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ABSTRACT 


Many advantages of the Internet—ease of aecess, limited regulation, vast potential 
audience, and fast flow of information—have turned it into the most popular way to 
communicate and exchange ideas. Criminal and terrorist groups also use these 
advantages to turn the Internet into their new play/battle fields to conduct their 
illegal/terror activities. There are millions of Web sites in different languages on the 
Internet, but the lack of foreign language search engines makes it impossible to analyze 
foreign language Web sites efficiently. This thesis will enhance an open source Web 
crawler with Arabic search capability, thus improving an existing social networking tool 
to perform page correlation and analysis of Arabic Web sites. A social networking tool 
with Arabic search capabilities could become a valuable tool for the intelligence 
community. Its page correlation and analysis results could be used to collect open source 
intelligence and build a network of Web sites that are related to terrorist or criminal 
activities. 
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EXECUTIVE SUMMARY 


After more than eight years of the War on Terrorism, Improvised Explosive 
Devices (lEDs) have become the weapon of choice for the terrorist in Iraq and 
Afghanistan. lEDs accounted for the majority of causalities of Allied forces and 
civilians. One of the reasons for the proliferation of lEDs is the ease of access to training 
material available on the Internet. The Internet is a cheap, convenient, yet powerful tool 
to access a vast reservoir of information and knowledge. Unfortunately, the Internet also 
empowers technology-savvy terror networks and extremist groups to create lED 
education networks and distribute the lED know-how to their operatives and supporters 
quickly and efficiently. 

One solution to counter this problem is a social networking tool that applies 
networking theory and social network analysis to identify terrorist lED education 
networks quickly. This tool would utilize an open source web crawler that could index 
Arabic websites into a searchable database for analyzing and querying to collect more 
actionable intelligence. 

The Nutch project was selected as the search engine of choice for this social 
networking tool. Its transparency ranking information allows the users the ability to 
tailor the ranking to meet the user’s specific requirements. Its versatile plug-in 
architecture provides extensibility, flexibility and maintainability. 

To enable Nutch indexing of Arabic websites, an Arabic language analyzer needs 
to be added into Nutch’s library. Multiple experiments were used to test the performance 
of the Arabic language analyzer with moderate results. 

Overall, Nutch with an added Arabic analyzer would be a valuable tool improving 
an existing social networking tool to perform page correlation and analysis of Arabic 
websites. Its results could be used to identify lED education networks and to collect open 
source intelligence. 
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I. INTRODUCTION 


A, MOTIVATION 

Since its invention, the Internet has revolutionized eommunication. It enables 
people to exehange ideas and share information rapidly and cheaply. Unfortunately, its 
laek of regulation and pervasive eommunieation also has turned it into the new tool for 
the teeh-savvy terrorists: “Today, almost without exception, all major (and many minor) 
terrorist and insurgent groups have web sites” [1]. Many terror organizations sueh as Al- 
Qaeda actively use the Internet to recruit new members, solicit donations from 
sympathizers, and spread propaganda. 

They also turn the Internet into their virtual training grounds, offering tutorials on 
building lEDs and planning attacks. These training materials are easily accessible to 
anyone with an Internet conneetion. This is the main contribution to the explosion of 
lED attaeks in Iraq and Afghanistan. To counter the proliferation of lED teehnology, 
these lED edueation networks need to be identified, monitored and referred to sovereign 
authorities for further action as necessary. 

One possible solution for this problem is a soeial networking tool that applies 
network seienee to identify the lED education network via the World Wide Web. In [2], 
network science is defined as the study of networks whieh “contrasts, compares, and 
integrates techniques and algorithms developed in disciplines as diverse as mathematies, 
statistics, physics, social network analysis, information seienee and computer science.” 
The soeial network tool would incorporate an open source web crawler that eould index 
Arabic websites into a searehable database for analyzing and querying. 

B, RESEARCH OBJECTIVES 

The research objectives of this thesis were to enhance a Web crawler engine with 
Arabic search capability that could index Arabic language websites profieiently, thus 
improving an existing soeial networking tool to perform page correlation and analysis of 
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Arabic websites. The newly enhaneed Web erawler eould help speed up the analytieal 
proeess of the soeial networking tool to effeetively identify lED edueation networks via 
the World Wide Web. 

C. THESIS ORGANIZATION 

This thesis eonsists of six ehapters. An overview of the motivation, objeetives 
and thesis organization is provided in Chapter I. A brief diseussion about information 
retrieval, a deseription of Arable information retrieval ehallenges, stemming in Arable 
and the light stemmer algorithm is eontained in Chapter II. Lueene—a sealable 
Information Retrieval (IR) library; Nuteh—an open souree seareh engine; and Nuteh’s 
plug-in arehiteeture are introdueed in Chapter III. The implementation proeess of the 
light stemmer algorithm into Lueene’s analyzer database, and development of the 
ArabicAnalyer plug-in are diseussed in Chapter IV. The performanee of Arabic Analyzer 
and NutchDocumentAnalyzer are eompared in Chapter V. The summary of the thesis and 
future researeh reeommendation are discussed in Chapter VI. 
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II. ARABIC INFORMATION RETRIEVAL 


A, INFORMATION RETRIEVAL 

The fast growth of the Internet aecompanied with the explosion of data available 
via the World Wide Web has made the finding of useful information a tedious and 
diffieult task. These diffieulties have attracted renewed interested in Information 
Retrieval and its techniques. 

Information Retrieval (IR) is the science of locating relevant documents in a large 
collection of documents. The retrieval process is influenced by queries supplied by the 
user’s input, the indexing process and the natural language that is being indexing [3]. 

In [4], some popular IR classic strategies are the Vector Space Model, 
Probabilistic Retrieval, Language Model, and Inference Networks. 

The Vector Space Model is a widely used retrieval strategy. In this model, both 
the query and each document are represented as vector in terms of space. A measure of 
similarity between the two vectors is computed. 

In the Probabilistic Retrieval model, a probability based on the likelihood that a 
term will appear in a relevant document is computed for each term in the collection. For 
terms that match between a query and a document, the similarity measure is computed as 
the combination of the probabilities of each of the matching terms [4]. 

In the Language Model, a language model is inferred for each document; then the 
probability of generating the query according to each of these models is computed. 
Documents are then ranked according to these probabilities [5]. 

Inference Networks, also known as Bayesian networks, are used to model 
documents, the documents’ contents and the query. It then uses this information to derive 
—“infer”—other relationships. The strength of this inference is then used as the 
similarity coefficient [4]. 
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B, THE CHALLENGES OF ARABIC IR 

According to [6], there are over 200 million native Arabie speakers in the world 
and over 20 million people speaking it as a second language. There is also an 
exponential growth of Internet in speaking countries. From [7], the numbers of Internet 
users in Middle East eountries alone have grown from 3 million in 2000 to 58 million in 
2009. So, there is inereasingly a demand for an Arabie IR as well, but Arabie poses 
many ehallenges for IR 

First, Arabie has a very eomplex morphology system. In [8], the authors 
observed; 

Arabic has two genders, feminine and maseuline; three numbers, singular, 
dual and plural; and three grammatieal eases, nominative, genitive, and 
aeeusative. A noun has the nominative ease when it is a subjeet, 
aecusative when it is the objeet of a verb, and genitive when it is the object 
of a preposition. 

This would eompound the eomplexity of any Arabie IR to deal with this morphology 
system. 

Seeond, there are a lot of ambiguities in Arabie. One of the major eontributions 
to this phenomenon is that orthographie variations are widespread in Arabic [9]. The 
authors gave an example that sometimes in eombining HAMZA with ALEE (') or 
MADDA with AEEF (1), the HAMZA (<^) or MADDA (-') is dropped, rendering it 
ambiguous to whether the HAMZA (<^ ) or MADDA (-') is present. Another contribution 
to the higher level of ambiguity is that sometimes vowels (diaerities) are omitted in 
written Arabie, whieh may ehange the meaning of the words. This uneertainty would 
affeet the preeision and reeall of any Arabie IR. 

Finally, the plural form of irregular nouns, broken plurals, is eommon in Arabic. 
A broken plural’s form does not resemble its initial singular form. It does not obey 
normal morphologieal rules. Beeause of that, it is very diffieult to design an algorithm to 
transform this kind of plural to singular form [9]. 
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c. 


RESEARCH IN ARABIC IR 


Research on Arabic IR has focused on using word roots and stems as index terms. 
A stem is the remainder of the word after removing prefixes and suffixes. On the other 
hand, the root is the origin of the word that remains after removing nonessential 
characters, prefixes and suffixes. When using word roots as index terms, a linguistic 
knowledge and an understanding of the languages’ morphology are needed. On the other 
hand, prior knowledge of the language is not required when using stems as index terms. 
In [10], the authors recognized that “stemming is one of many tools besides 
normalization that is used in information retrieval to combat the vocabulary mismatch 
problem.” As discussed in section 2b, Arabic is very difficult to stem, therefore, there 
were only a few available Arabic stemmers. 

One of the earliest stemmers was the root-based stemmer proposed by Khoja and 
Garside. This stemmer removed all the stopwords, punctuation, and numbers. Then it 
peeled away prefixes and suffixes. After that, it matched the result against a list of 
patterns to extract the root. Finally, it matched the extracted root against a list of known 
“valid” roots. There are a few weaknesses in the Khoja stemmer. First, it can provide 
wrong solutions when removing prefixes and suffixes. It also can generate wrong roots 
for words that contain EBDAL [10], [11], [12]. 

Buckwalter’s morphological analyzer is another useful stemmer. First, this 
stemmer converts the Arabic word into English letters. Then, it segments it into all 
probabilities of prefixes, stems, and suffixes. After that, it checks every probability with 
its build-in lexicon libraries (prefixes dictionary, stems dictionary and suffixes 
dictionary). If all the word elements (prefix, stem, suffix) are found in their respective 
libraries, three truth tables indicating their legal combination (prefixes-suffixes, prefixes- 
stems, and stems-suffixes) are used to determine whether they are compatible. If the word 
elements pass all three truth tables, the probability is valid. This stemmer provides highly 
reliable results, but its performance is slow [13]. 

The light stemmer is another approach for Arabic IR. Most light stemmers in [8], 
[14] are based on the same idea: extract stems by deleting the most frequent prefixes and 
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suffixes. These stemmers are not interested in produeing the Arabie root. This thesis 
applies the light stemmer algorithm in [14] to enable the Web erawler with an Arabie 
seareh eapability. A more detailed diseussion is in the next seetion. 

D, LIGHT STEMMER ALGORITHM 

1. Introduction 

The light stemmer allows for good information retrieval results without providing 
the eorreet morphologieal analyses [10]. Anyone ean employ the light stemmer 
algorithm without the required language skills. 

2. The Algorithm 

The stemmer has two parts: Normalization and Stemmer. The Normalization 
proeess is used to normalize the orthography—the writing system—of the queries and 
eorpus. The stemmer removes suffixes using the light stemmer algorithm to extraet the 
stems [14]. 


a. Normalization 

In [14], before stemming, eorpus and queries are normalized as follows: 

(1) . Convert to Windows Arabie eneoding (CP12560). 

(2) . Remove punetuation. 

(3) . Remove diaerities (primary weak vowels). 

(4) . Remove non letters. 

(5) . Replaee i (ALEF with MADDA above), i (ALEE with 

HAMZA above), and ! (ALEF with HAZA below) with ' 
(ALEF) 

(6) . Replaee final ci (ALEF MAKSURA) with (YEH) 

(7) . Replaee final»(TEH MARBUTA) with «(HEH) 
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b. Light Stemmers 


After the eorpus and queries are normalized, the stemmer is applied as 

follows: 

(1) . Remove j (WAW) if the remainder of the word is three or 

more eharaeters long. 

(2) . Remove any of the definite artieles if this leaves two or 

more eharaeters. 


(3) Go through the list of suffixes onee in the right to left order 
indieated in Figure 1, removing any that are found at the 
end of the word, if this leaves two or more characters. 



Remove from front 

Remove Suffixes 

Lightl 

Jls .J15 .JL .Jl 3 .Jl 

none 

Light2 

3 jlJIs <JIS (JL (JI 3 (Jl 

none 

Light3 

• b 

6 (d 

Lights 


‘l>! ‘*^3 

(6 (6 (^ 


Figure 1. String removed by light stemming. From [14] 

Light 1, Light2, Lights and Lights apply the same algorithm in the stemming 
process. The difference between them is the number of prefixes and suffixes that are 
removed in step 3 of the light stemmer’s algorithm. In Lightl, the Light Stemmer 
algorithm only removes five prefixes and no suffixes. In Light2, the Light Stemmer 
algorithm removes six prefixes and no suffixes. In Lights, the Light Stemmer algorithm 
removes six prefixes and two suffixes. In Lights, the Light Stemmer algorithm removes 
six prefixes and 10 suffixes. 
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3 . 


Results 


The authors in [14] compared the retrieval effectiveness of the light stemmer 
algorithm (LightS) and of a morphological analyzer (Khoja stemmer). Raw in Figure 2 
means no normalization and stemming. From Figure 2, we see that the light stemmer 
outperforms Khoja stemmer and raw retrieval. From Table 1, we see that light stemmer 
improved over 90% in average precision from raw retrieval. 

The authors concluded that stemming is very effective on Arabic IR. For 
monolingual retrieval, the light stemmer has demonstrated improvement of around 100% 
in average precision due to stemming and related processes. 


o 

CO 

o 

lU 

QC 

Q. 


Figure 2. Monolingual 11-point precision results. From [14] 
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Table 1. The uniterpolated average preeision. From [14] 


Steiiiiiier 

ra>v 

khoja-ii 

khoja 

lights 

Av. Precision 

.194 

.313 

.341 

.376 

Pet. Change 


61.7 

76.2 

94.3 


E. CHAPTER SUMMARY 

In this chapter, the challenges of Arabic IR and past Arabic IR research were 
covered. Also discussed was the effectiveness of light stemmer in Arabic IR. In the next 
chapter, Lucene, Nutch andNutch’s plug-in architecture are introduced. 
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III. LUCENE AND NUTCH 


A, INTRODUCTION 

Lucene and Nutch, created by Doug Cutting, are two open-souree software 
projeets. Aceording to [15], Lueene is a high performanee, sealable Information 
Retrieval (IR) library that provides Java-based indexing and searehing teehnology and 
advaneed analysis/tokenization eapabilities. On the other hand, Nuteh is a seareh engine 
that was built on top of Lueene. Together, they ean make a full-featured seareh engine 
that offers transpareney into how Web sites are ranked, and an understanding of how a 
large seareh engine works [16]. 

B, LUCENE 

1. Overview 

Lueene is a software library that enables users to add indexing and searehing 
eapabilities to their applieation. Lucene ean index and search any type of data as long as 
it can be eonverted into a text format. This means Lueene ean be used to seareh Web 
pages, pdf files, and Mierosoft® Word files beeause textual information ean be extracted 
from them. With this feature, Lueene is the best toolkit for a search engine. 

2. Indexing Process 

Indexing is the proeess of eonverting text into an index, a data strueture that 
improves the speed of data retrieval operations. The index is the fundamental eomponent 
of Lueene. 

From [16], to index data with Lueene, the data must be converted into a stream of 
plain text tokens, a format that Lucene ean proeess. After that, Lueene prepares the data 
for indexing by breaking the stream of plain text into ehunks or tokens and performing a 
number of operations on them. For instanee, the tokens eould be lowerease before 
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indexing, to make the search case-insensitive. This step is called analysis. After the 
input has been analyzed, it is ready to be added into the index. The Indexing process is 
illustrated in Figure 3. 



Figure 3. Lucene indexing architecture. From [17] 

Lucene implements an innovative approach to maintaining the index—rather than 
maintaining a single index, Lucene builds multiple index segments and merges them 
periodically. Using segments allows a quick way to add new documents to the index by 
adding them to the newly created index segments and only periodically merging them 
with other existing segments. This process makes additions efficient because it 
minimizes physical index modifications. 

Some IR libraries need to index the whole corpus again when new data is added 
to their index; Lucene does not need to do that because it supports incremental indexing. 
This means Lucene allows the contents of newly added documents be searchable 
immediately without indexing the whole corpus again [15]. 
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3, Analyzer 


As discussed above, analysis is a very important step in the indexing proeess. It 
eonverts a field of text into the most fundamental indexed representation, terms. These 
terms are used to determine what doeuments mateh a query during searches. 

An analyzer is an eneapsulation of the analysis proeess. The analyzer’s job is to 
proeess strings of text into a stream of tokens by performing any number of operations on 
them. Lueene includes several built-in analyzers that do a good job at analyzing English- 
based text. For analyzing non-English languages, speeifie language analyzers are needed. 
Eucene’s core API provides building bloeks to ereate eustom language analyzers. 

C. NUTCH 

1. Architecture Overview 

Nuteh is a complete open-source Web search engine that can operate at one of 
three scales: local file system, intranet, or whole Web [15]. Nuteh can be divided into 
two parts: the crawler and the searcher. 

From [18], components of the crawler are WebDB, the fetch list, fetchers and 
updates. WebDB is a custom database that tracks every known page and relevant link. It 
maintains a small set of facts about each page, such as the last crawled date. Fetch lists 
are generated from WebDB. These lists contain the ElRLs that users want to download. 
The fetchers consume the fetch lists to produce the WebDB updates and the Web 
contents. The updates tell which page has changed since the last crawl. The contents are 
used to search. The WebDB-fetch cycle is designed to repeat forever, maintaining an up- 
to-date image of the Web. 

Once the Web content is produced, Nuteh can get ready to process queries using 
the searchers. First, the indexer processes the Web content of all terms and pages into an 
inverted index. The document set is divided into a set of index segments, each of which 
is fed into a single searcher process. Each searcher also draws upon the Web content 
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from earlier to provide a eached copy of any Web page. Finally, a pool of Web servers 
handles the interaction with users and contact searcher for results. A generic overview of 
Nutch’s architecture is shown in Figure 4. 



Figure 4. Nutch’s architecture. From [18] 


2, Plug-In Architecture 

Nutch’s plug-in system is based on the Eclipse 2.0 plug-in architecture. It 
provides a core service for controlling a set of tools working together to support 
programming tasks. After reviewing Eclipse’s architecture from [19] and applying it to 
Nutch’s plug-in system, we observe that the three most important components of Nutch’s 
plug-in system are Extension, ExtensionPoints and Plug-in. The Extension class provides 
a way to add some new functions to a plug-in. It is defined by a plug-in that wants to 
extend its functionality to another plug-in. ExtensionPoints define an interface that must 
be implemented by the Extension. A plug-in, pluggable component, defines a number of 
extension-points that may allow it to be augmented by different kinds of extension. 

This system is the mechanism of Nutch’s extensibility. Users can contribute to 
the Nutch platform by wrapping their tools in plug-ins. The new plug-ins can add new 
processing elements to existing plug-ins, and Nutch provides a set of core plug-ins to 
assist the process. 
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D, 


CHAPTER SUMMARY 


In this chapter, the overview of Lucene’s indexing proeess and analyzer were 
examined. The overview of Nutch’s arehitecture and its plug-in system were also 
studied. In the next ehapter, the implementation proeess of the light stemmer algorithm 
into Nuteh is diseussed. 
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IV. ARABICANALYZER PLUG-IN DEVELOPMENT 


A, INTRODUCTION 

When Nutch finishes fetehing a segment of Web sites, the language-identifier 
plug-in is ealled to identify the language of the Web sites and attaeh a language eode to 
those Web sites. After that, the Analyzerfactory instantiates the NutchAnalyzer interface, 
which defines an extension point that associates with the specific language code. The 
NutchAnalyzer extension point is an abstract class that extends the Lucene Analyzer 
class, so that Lucene analyzers can be easily integrated as NutchAnalyzer plug-ins. The 
policy of the Analyzerfactory for finding the NutchAnalyzer extension to use is to return 
the first one that matches a specified language code. If none is found, then the default 
NutchDocumentAnalyzer is used. After Analyzerfactory identifies the right analyzer 
basing on the language code, the NutchAnalyzer calls the correct analyzer, in this case 
Arabic Analyzer, from the Lucene analyzer library to index the Web site. The process of 
indexing a Web site is shown in Figure 5. 



Figure 5. The process of indexing a Web site 
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B, REQUIREMENT 


To enable Nuteh with Arable-search capability, there are several tasks that need to 
be completed. First, the Lucene analysis library needs to be updated with the 
Arabic Analyzer that implemented the light stemming algorithm. Secondly, an 
Arabic Analyzer plug-in is needed for Nuteh to be able to access the Lucene analysis 
library. Finally, an Arabic Ngram profile is needed to train Nuteh how to recognize 
Arabic text. 

C. DEVELOPMENT PROCESS 

1. Implementation of the Light Stemmer Algorithm 

As stated above, the Lucene analysis library needs to be updated with the 
Arabic Analyzer, which implements the light stemmer algorithm. The analysis package 
contains three primary files: Arabic Analyzer, ArabicNormalizationFliter, and 
ArabicSteniFilter. 

The Arabic Analyzer first creates a list of Arabic stop words that is based on the 
stoplist from http://members.unine.ch/jacques.savoy/clef/index.html. It uses the standard 
Stopfilter to filter out all the stop words from the token stream. The result is then fed into 
ArabicNormalizationFliter, which normalizes the orthography. The final result is then 
fed into the ArabicStemFliter, which stems the token stream using the light stemmer 
algorithm. 

2, Development of ArabicAnalyzer Plug-in 

The host plug-in is the ArabicAnalyzer class in Nuteh. The NutchAnalyzer, a 
Nuteh built-in extension point, defines the interface that must be implemented by the 
Nutch’s ArabicAnalyzer. The extender plug-in is the ArabicAnalyzer from Lucene’s 
analysis library that extends the functions of the Nutch’s ArabicAnalyzer, in this case, the 
Lucene’s ArabicAnalyzer enables the Nutch’s ArabicAnalyzer to index Arabic text. 
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Basically, the Nutch’s Arah/cAna/yzer plug-in is a wrapper that sets the stages and makes 
it possible to run Lueene’s ArabicAnalyzer. The ArabicAnalyzer plug-in arehiteeture that 
was derived from [19] is shown in Figure 6. 


Host plug-in 


Plug-in class 


ArabicAnalyzer 


Plug-in id: org.apache.nutch.analysis.ar 


Extension point 


NutchAnalyzer 


extender plug-in 


Plug-in class 


Lib.lucene.analyzer 


Extension: Analyzer 


Class: ArabicAnalyzer 


Plug-in id: org.apache.lucene.analysis.ai 


Figure 6. ArabicAnalyzer plug-in arehiteeture. From [19] 

3, Creating Arabic Ngram profile 

Nuteh uses the language-identifier plug-in in standard Nuteh’s library to ereate an 
Arabie profile based on the “1000 most frequent words” by Jaeques Savoy from the Web 
site http://members.unine.eh/jaeques.savoy/elef/index.html. This trains Nuteh to 
“reeognize” Arabie Web sites so that it ean invoke the right analyzer to index the Web 
sites. 
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D, 


CHAPTER SUMMARY 


In this chapter, the ArabicAnalyzer plugin development proeess is discussed. The 
Lucene’s analyzer library is enhaneed with the ArabicAnalyzer that implements the light 
stemming algorithm. The Nutch’s plug-in arehiteeture is utilized to ereate the 
ArabicAnalyzer plug-in. The plug-in enables the Nuteh seareh engine to index Arabie- 
language Web sites using the ArabicAnalyzer in the Lueene’s analyzer library. In the 
next ehapter, the performanee of ArabicAnalyzer and NutchDocumentAnalyzer are 
eompared. 
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V. EXPERIMENTAL SETUP 


A, PROBLEM STATEMENT 

These experiments will eompare the result of Nuteh when it used the default 
NutchDocumentAnalyzer with ArabicAnalyzer to analyze the same Web site. 

The NutchDocumentAnalyzer separates the stream of tokens into individual terms 
without applying any filter. For example, the token stream “hello world” beeomes 
“hello” “world” after NutchDocumentAnalyzer proeesses it. This study uses 
NutchDocumentAnalyzer’s index result as a baseline, beeause no term is disearded during 
indexing when using NutchDocumentAnalyzer [15]. 

On the other hand, the Arabic Analyzer applies several filters when analyzing the 
stream of tokens. First, the token stream goes to StopFilter, whieh removes all the stop 
words in the eustom-built stop words list. The result is then filtered again using 
Arabic No rmalizationFilter to normalize the orthography. After that, the result again is 
filtered using ArabicStemFilter, whieh applies the light stemming algorithm. The final 
result is then stored into the index database. 

B. HARDWARE AND SOFTWARE CONFIGURATION 

The platform used to eonduet the experiments was a single Dell XPS Ml530 
laptop personal eomputer. This maehine had an Intel Core 2 Duo CPU T9300 at 2.5 GHz 
with 4 GB of RAM and a 185 GB hard disk. The operating system used was Mierosoft 
Windows Vista Home Premium with Serviee Paek 2. 

Nuteh 1.0 and Lueene 2.4.0 were used to implement the light stemmer algorithm 
and for all the experiments. 
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c. 


METHODOLOGY 


There were three experiments to eolleet data. The first experiment used Nutch to 
erawl eight Web sites with the depth of five and topN of 50. TopN determines the 
maximum number of pages that are retrieved at eaeh level up to the depth. The Web sites 
are alarabiya.net, aljazeera.net, alriyadh.com, addustour.com, aawsat.com, 
bbc.co.uk/Arabic/, arabic.cnn.com and america.gov/ar/. Nuteh only indexes the Web 
pages within these sites using Arabic Analyzer and NutchDocumentAnalyzer. 

The seeond experiment eomputes the average erawl time and its standard deviation. 
The erawler was set to erawl four out of the eight Web sites above 25 times each. 

The third experiment compares the ranking of the top 10 pages after using the two 
algorithms to search for three different Arabic terms. 

To disable ArabicAnalyzer, the following code was added into the property block 
of nutch-site.xml file in the conf folder so that Analyzer Factory is forced to use 
NutchDocumentAnalyzer to index these sites by not specifying any analyzer: 

<property> 

<name>plugin.includes</name> 

<value>protocol-http\urlfilter-regex\parse-(text\html\js)\index- 

(basic\anchor)\query-(basic\site\url)\response-(json\xml)\summary-basic\scoring- 

opic\language-identifier</value> 

<description>Regular expression naming plugin directory names to 
include. Any plugin not matching this expression is excluded. 

</description> 

</property> 

To enable ArabicAnalyzer, the following code replaces the above code within the 
nutch-site.xml file. With ArabicAnalyzer on, the AnalyzerFactory uses it to index these 
sites: 

<property> 

<name>plugin.includes</name> 
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<value>protocol-http\urlfilter-regex\parse-(text\html\js)\index- 

(basic\anchor)\query-(basic\site\url)\response-(json\xml)\summary-basic\scoring- 

opic\language-identifier\analysis-ar</value> 

<description>Regular expression naming plugin directory names to 
include. Any plugin not matching this expression is excluded. 

</description> 

</property> 

D, RESULTS AND DISCUSSION 

1. Terms Count 

The first experiment shows that Nuteh needs 20% to 37% fewer terms to index 
the same number of documents from the same Web site when it uses ArabicAnalyzer. 
The result also means that the ArabicAnalyzer plug-in is more efficient when searching 
its index database, because it searches fewer terms to locate the relevant documents. See 
Table 2 for the detailed breakdown of each Web site. 


Table 2. The number of terms counts 


Web sites 

NutchDocumentAnalyzer 

(Terms count) 

ArabicAnalyzer 

(Terms count) 




arabic.cnn.com 

24776 

15827 

alarabiya.net 

21140 

15806 

alriyadh.com 

20898 

13163 

aljazeera.net 

18096 

13658 

bbc.co.uk/arabic/ 

16061 

9957 

america.gov/ar/ 

11435 

7958 

addustour.com 

2888 

2075 

aawsat.com 

1050 

847 


23 







2 . 


Crawl Time 


The second experiment shows that Nutch takes longer to index the same Web 
sites when it uses ArabicAnalyzer. This result is expected, because there are more fdters 
in ArabicAnalyzer: thus, it requires more processing power and time to index Web sites. 

The results, as illustrated in Tables 3 to 6, also show that the crawl times 
fluctuated more when Nutch used ArabicAnalyzer. 


Table 3. Average crawl time of www.america.gov/ar/ 



Average Crawl time (sec) 

Standard Deviation (sec) 

NutchDocumentAnalyzer 

362.92 

15.7 

ArabicAnalyzer 

375.2 

25.32 


Table 4. Average crawl time of www.bbc.co.uk/arabic/ 



Average Crawl time (sec) 

Standard Deviation (sec) 

NutchDocumentAnalyzer 

482.76 

5.95 

ArabicAnalyzer 

546.64 

37.05 


Table 5. Average crawl time of www.addustour.com 



Average Crawl time (sec) 

Standard Deviation (sec) 

NutchDocumentAnalyzer 

104.56 

1.67 

ArabicAnalyzer 

105.12 

2.38 


Table 6. Average crawl time of www.aawsat.com 



Average Crawl time (sec) 

Standard Deviation (sec) 

NutchDocumentAnalyzer 

69.56 

2.52 

ArabicAnalyzer 

70.2 

2.84 
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3 . 


Search Results 


For the third experiment, the index database of the Web site www.ameriea.gov/ar/ 
is used to colleet search results data. The terms shown in Table 7 are used for the search. 


Table 7. Search terms 


Normal Form 

Light Stemmer 

Form 

Meaning 



Economy 

1£ 


The United States 



Democratic 


The Light Stemmer forms are searched using the AmbicAnalyzef s index database 
and the Normal forms are searched using the NuthDocumentAnalyzefs index database. 

When comparing the top 10 pages of the search term “economy,” the top seven 
pages are the same; for the search term “The United States,” all top 10 pages are the 
same; and for the search term “Democratic,” six pages are the same but with the ranking 
different. In all three cases, the search results from NutchDocumentAnalyzer have better 
ranking scores than the search results from ArabicAnalyzer. 

By the title of the search results, one can conclude that their contents are related to 
the search terms. The two algorithms also hit a high mark on relevance of information 
that relates to the search terms. See Tables 8 through 13 for the breakdown. 
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Table 8. Search results of term “Economy” using ArabicAnalyzer 


Top 10 pages using ArabicAnalyzer 

Score for Query 

WWW . americ a. go v/ar/econ. html 

0.3486507 

www.america.gov/ar/publications/books/outline-of-the-us-economy.html 

0.12422927 

www.america.gov/ar/econ/business.html 

0.09217107 

www.america.gov/ar/reviving trade ar.html 

0.033118278 

WWW. america.gov/ar/publications/books.html#outlme economy 

0.016127191 

http://www.america.gov/ar/ 

0.003058498 

http://www.america.gov/ar/multimedia/photogallery.html 

6.69E-04 

www.america.gov/ar/publications/books.html 

6.47E-04 

www.america.gov/ar/publications/ejournalusa/1209.html 

5.85E-04 

WWW . america. go v/ ar/index .html 

5.73E-04 
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Table 9. Search results of term “Economy” using NutchDocumentAnalyzer 


Top 10 pages using NutchDocumentAnalyzer 

Score for Query 

www.america.gov/ar/econ.html 

0.38501537 

www.america.gov/ar/publications/books/outline-of-the-us-economy.html 

0.13663794 

www.america.gov/ar/econ/business.html 

0.09989148 

www.america.gOv/ar/pubhcations/books.html#outlme economy 

0.01747075 

WWW. america. gov/ar/ 

0.002472951 

www.america.gov/ar/multimedia/photogallery.html 

5.41E-04 

WWW . amer ic a. go v/ar/index. html 

4.64E-04 

www.america.gov/ar/world/europe.html 

4.64E-04 

www.america.gov/ar/world/mideast.html 

4.64E-04 

www.america.gov/ar/world/scasia.html 

4.02E-04 


27 




Table 10. Search results of term “The United States” using ArabicAnalyzer 


Top 10 pages using ArabicAnalyzer 

Score for Query 

www.america.gov/ar/pages/footer/local/about-us.html 

0.1196895 

www.america.gov/ar/publications/books-content/musliminamerica.html 

0.11078926 

WWW . america. go v/ar/amlife .html 

0.105654 

www.america.gov/ar/services/mobile.html 

0.042377986 

WWW. america.gov/ar/multimedia/photogallery.html#/4110/mosques ar/ 

0.022628564 

www.america.gOv/ar/pubhcations/books.html#bemgmuslim 

0.015091554 

www.america.gOv/ar/multimedia/photogallery.html#/4110/religious freedom ar/ 

0.01136245 

www.america.gov/ar/publications^ooks.html#governed 

0.01132718 

www.america.gOv/ar/multimedia/photogallery.html#/4110/islam ar/ 

0.011314282 

www.america.gov/ar/ 

0.003082759 
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Table 11. Search results of term “The United States” using NutchDocumentAnalyzer 


Top 10 pages using NutchDocumentAnalyzer 

Score for Query 

www.america.gov/ar/pages/footer/local/about-us.html 

0.11997691 

www.america.gov/ar/publications/books-content/musliminamerica.html 

0.11105819 

WWW . america. go v/ ar/amlife .html 

0.10594571 

www.america.gov/ar/services/mobile.html 

0.042462345 

WWW. america.gov/ar/multimedia/photogallery.html#/4110/mosques_ar/ 

0.022691099 

www.america.gOv/ar/pubhcations/books.html#beingmuslim 

0.01513323 

www.america.gOv/ar/multimedia/photogallery.html#/4110/religious freedom ar/ 

0.011393607 

www.america.gov/ar/publications^ooks.html#governed 

0.011358418 

www.america.gOv/ar/multimedia/photogallery.html#/4110/islam ar/ 

0.01134555 

www.america.gov/ar/ 

0.002807639 
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Table 12. Search results of term “Democratic” using ArabicAnalyzer 


Top 10 pages using NutchDocumentAnalyzer 

Score for Query 

www.america.gov/ar/global/democracy.html 

0.2665834 

WWW. america. gov/ar/global.html 

0.16062789 

www.america.gov/ar/publications/ejoumalusa/608.html 

0.033635326 

WWW . america. gov/ar/publications/ej ournalusa/0110. html 

0.030587077 

www.america.gov/ar/democracy/global/index.html 

0.027611194 

WWW . americ a. go v/ ar/ 

0.002160408 

www.america.gov/ar/multimedia/podcast.html 

6.10E-04 

www.america.gov/ar/publications/books.html 

5.85E-04 

www.america.gov/ar/amlife.html 

5.46E-04 

www.america.gov/ar/publications/ejournalusa.html 

5.40E-04 
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Table 13. Search results of term “Democratic” using NutchDocumentAnalyzer 


Top 10 pages using NutchDocumentAnalyzer 

Score for Query 

www.america.gov/ar/global/democracy.html 

0.29354417 

WWW . america. gov/ar/global .html 

0.17680001 

www.america.gov/ar/publications/ejournalusa/0110.html 

0.033559922 

www.america.gov/ar/democracy/global/index.html 

0.030395675 

www.america.gov/ar/ 

0.002139257 

www.america.gov/ar/multimedia/podcast.html 

6.04E-04 

http ://www. america.gov/ar/amlife .html 

5.40E-04 

www.america.gov/ar/amlife/people.html 

4.68E-04 

www.america.gov/ar/econ.html 

4.68E-04 

www.america.gov/ar/multimedia.html 

4.68E-04 
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A more detailed breakdown of the score of the top 10 pages using ArabicAnalyzer 
and NutchDocumentAnalyzer is shown in Appendices A through F. 

E. CHAPTER SUMMARY 

In this chapter, the results of several experiments to compare the performance of 
ArabicAnalyzer and NutchDocumentAnalyzer were described. In the next chapter, the 
thesis summary and future work recommendations are discussed. 
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VI. CONCLUSION 


A, SUMMARY 

Arabic IR is a challenging problem beeause of the eomplexity of Arabic 
languages. Even though the light stemmer algorithm was not a perfect solution for 
Arabic IR problem, it showed improvement over other popular methods. The 
ArabieAnalyzer plug-in inherited the same strengths and weaknesses of the light stemmer 
algorithm. It also was not perfeet, but it did show great promise in saving storage 
overhead. 

The experiments completed in this thesis showed that there are advantages and 
disadvantages when implementing the ArabieAnalyzer plug-in. It is clear by looking at 
the data that, in general, the ArabieAnalyzer plug-in performed as well as the default 
setting. The query results were relevant to the seareh terms. It was observed that the 
plug-in ran slower than the default setting, but the speed issue eould be overlooked sinee 
the data that this researeh was trying to gather did not have to be in real time. On the 
other hand, the ArabieAnalyzer plug-in would require at least 20% less memory for its 
index database, eompared with the default setting: the savings in storage eould beeome a 
major plus when indexing the Internet. 

B, FUTURE WORK 

For future researeh, the plug-in needs to be integrated into the soeial networking 
tool and experiments need to be eondueted to determine the reeall, preeision and 
relevanee of the plug-in in the integration environment. The experiments should also 
help determine the strengths and weaknesses of the plug-in in sueh environments and 
reeommend improvement. 
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APPENDIX A 


This is the detail seore for query of top 10 pages using ArabicAnalyzer. 

Seareh Term; (eeonomy) 

Page 1: 

• boost = 0.22821301 

. digest = 767d250a62c827c2bd330e0674546358 

• lang = ar 

• segment = 20100305180909 

• title = - Ameriea.gov 

. tstamp = 20100305230954510 

• url = http://www.amerioa.gov/ar/eoon.html 
score for query: t 

. 0.3486507 = (MATCH) sum of: 

o 0.18338637 = (MATCH) weight(anohor:>ti‘^t>=‘'-^^2.0 in 15), produot of: 
■ 0.2879631 = queryWeight(anohor:>ti‘^L>=>'-^^2.0), product of: 

- 2.0 = boost 

- 4.075775 = idf(docFreq=5, numDocs=130) 

■ 0.035326175 = queryNorm 

- 0.63683987 = (MATCH) fieldWeight(anchor:in 15), 
product of: 

■ 1.0 = tf(termFreq(anchor:>ti‘^t_K>b)=l) 

- 4.075775 = idf(docFreq=5, numDocs=130) 

- 0.15625 = fieldNorm(field=anchor, doc=15) 

o 6.6904654E-4 = (MATCH) weight(content: in 15), product of: 

- 0.03756986 = queryWeight(content:>ti‘^L>=‘'-^), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 
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0.035326175 = queryNorm 


■ 0.017808065 = (MATCH) fieldWeight(content3ti‘^t_K>'-^ in 15), 
product of: 

■ 2.4494898 = tf(termFreq(content:'ii‘^L>=>'-^)=6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.0068359375 = fieldNorm(field=content, doc=15) 

o 0.16459529 = (MATCH) weight(title:'ti‘^t>=‘'-^^1.5 in 15), product of: 

■ 0.23745762 = queryWeight(title:'ii‘^L>=>''^^1.5), product of: 

- 1.5= boost 

- 4.4812403 = idf(docFreq=3, numDocs=130) 

- 0.035326175 = queryNorm 

- 0.6931565 = (MATCH) fieldWeight(title:'ti‘^t>=‘'-^ in 15), product 

of: 

- 1.4142135 = tf(termFreq(title:'ti‘^t>=‘'-:')=2) 

« 4.4812403 = idf(docFreq=3, numDocs=130) 

■ 0.109375 = fieldNorm(field=title, doc=15) 

Page 2: 

• boost = 0.16124225 

• digest = fdaal7fd08dfde3bb91a83a6d98afa04 

• lang = ar 

• segment = 20100305180909 

• title = j - Outline of the U.S. Eeonomy - America.gov 

. tstamp = 20100305230918398 

• url = http://www.america.gov/ar/publications/books/outline-of-the-us- 
economy.html 


score for query: t 

. 0.12422927 = (MATCH) sum of: 
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o 0.07335455 = (MATCH) weight(anchor4ii‘^tj-=>>-^^2.0 in 84), product of: 

■ 0.2879631 = queryWeight(anchor:>ti‘^t>=‘'-^^2.0), product of: 

- 2.0 = boost 

- 4.075775 = idf(docFreq=5, numDocs=130) 

■ 0.035326175 = queryNorm 

■ 0.25473595 = (MATCH) fieldWeight(anchor:in 84), 
product of: 

■ 1.0 = tf(termFreq(anchor:>ti‘^u-ab)=i) 

- 4.075775 = idf(docFreq=5, numDocs=130) 

" 0.0625 = fieldNorm(field=anchor, doc=84) 

o 9.948079E-4 = (MATCH) weight(content:>ti‘^L>=>'-^ in 84), product of: 

- 0.03756986 = queryWeight(content:>ti‘^t>=‘'-^), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 

■ 0.026478883 = (MATCH) fieldWeight(oontent:'ti‘^c>=>>-:' in 84), 
produet of: 

■ 5.0990195 = tf(termFreq(oontent:'ii‘^t>=‘E)=26) 

- 1.0635134 = idf(dooFreq=121, numDoos=130) 

■ 0.0048828125 = fieldNorm(field=oontent, doe=84) 

o 0.049879905 = (MATCH) weight(title:'ii‘^L>=>''^^1.5 in 84), product of: 

- 0.23745762 = queryWeight(title:'ii‘^L>=>E^1.5)^ product of: 

■ 1.5= boost 

- 4.4812403 = idf(docFreq=3, numDocs=130) 

■ 0.035326175 = queryNorm 

- 0.21005814 = (MATCH) fieldWeight(title:in 84), product 
of: 

- 1.0 = tf(termFreq(title:'ti‘^L>=‘'-^)=l) 

■ 4.4812403 = idf(docFreq=3, numDocs=130) 
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0.046875 = fieldNorm(field=title, doc=84) 


Page 3: 

boost = 0.16781548 

. digest = b4649130898e202ca38ef61b6b22b917 

• lang = ar 

• segment = 20100305180909 

• title = ■ Ameriea.gov 

. tstamp = 20100305231006880 

• url = http://www.america.gov/ar/eoon/business.html 

score for query: 

. 0.09217107 = (MATCH) sum of: 

o 0.091693185 = (MATCH) weight(anchor:>ti‘^t>=‘'-^^2.0 in 16), product of: 
- 0.2879631 = queryWeight(anchor:>ti‘^t>=‘'-^^2.0), product of: 

■ 2.0 = boost 

■ 4.075775 = idf(docFreq=5, numDocs=130) 

- 0.035326175 = queryNorm 

■ 0.31841993 = (MATCH) fieldWeight(anchor:in 16), 
product of: 

■ 1.0 = tf(termFreq(anchor:'ti‘^c>=>>-:')=l) 

■ 4.075775 = idf(docFreq=5, numDocs=130) 

■ 0.078125 = fieldNorm(field=anchor, doc=16) 

o 4.7789037E-4 = (MATCH) weight(content:'ii‘^L>=>'-^ in 16), product of: 

■ 0.03756986 = queryWeight(content:>ti‘^t>=‘'-^), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 
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0.012720046 = (MATCH) lieldWeight(content:'ti‘^t>=‘'-^ in 16), 
product of: 


■ 2.4494898 = tf(termFreq(content:'ii‘^L>=>'-^)=6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

" 0.0048828125 = fieldNorm(field=content, doc=16) 

Page 4: 

• boost = 0.030659562 

. digest = 4304d87ald51187c1c1d0b2b4dl597a8 

• lang = ar 

• segment = 20100305181031 

• title = Jut'cH > - America.gov 

. tstamp = 20100305231127141 

• url = http://www.america.gov/ar/reviving_trade_ar.html 
score for query: tjcj^ab 

. 0.033118278 = (MATCH) sum of: 

o 0.018338637 = (MATCH) weight(anohor:>ti‘^u-=>''^^2.0 in 104), produet of: 
■ 0.2879631 = queryWeight(anohor:>ti‘^L>=>'-^^2.0), product of: 

- 2.0 = boost 

- 4.075775 = idf(docFreq=5, numDocs=130) 

■ 0.035326175 = queryNorm 

- 0.06368399 = (MATCH) fieldWeight(anchor:in 104), 

product of: 

■ 1.0 = tf(termFreq(anchor:>ti‘^t_K>b)=l) 

■ 4.075775 = idf(docFreq=5, numDocs=130) 

■ 0.015625 = fieldNorm(lield=anchor, doc=104) 

o 8.363082E-5 = (MATCH) weight(content:>ti‘^t>=‘'-^ in 104), product of: 
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■ 0.03756986 = queryWeight(content9ti‘^L>=>'-^), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035326175 = queryNorm 

- 0.002226008 = (MATCH) fieldWeight(content:'ti‘^t>=‘'-:' in 104), 
product of: 

- 2.4494898 = tf(terrrLFreq(content:'ii‘^L>=>'-^)=6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 8.544922E-4 = lieldNorm(lield=content, doc=104) 

o 0.014696008 = (MATCH) weight(title:'ii‘^L>=>'-^^1.5 in 104), product of: 

■ 0.23745762 = queryWeight(title:'ii‘^L>=‘'-^^1.5), product of: 

- 1.5= boost 

« 4.4812403 = idf(docFreq=3, numDocs=130) 

■ 0.035326175 = queryNorm 

- 0.06188897 = (MATCH) fieldWeiglit(title:l(i‘^L>-'^ in 104), 
product of: 

■ 1.4142135 = tf(termFreq(title:'ti‘^t>=‘'-:')=2) 

- 4.4812403 = idf(docFreq=3, numDocs=130) 

- 0.009765625 = fieldNorm(field=title, doc=104) 

Page 5: 

• boost = 0.02675021 

• digest = aaf055cle690c63cf69285f8ab04f499 

• lang = ar 

• segment = 20100305181330 

• title = - America.gov 

. tstamp = 20100305231345369 

• url = http://www.america.gOv/ar/publications/books.html#outline_economy 


40 




score for query: 

. 0.016127191 = (MATCH) sum of: 

o 0.016046308 = (MATCH) weight(anchor:in 80), product of: 

- 0.2879631 = qucryWcight(anchor:>ti‘^t>=‘'-^^2.0), product of: 

■ 2.0 = boost 

■ 4.075775 = idf(docFrcq=5, numDocs=130) 

" 0.035326175 = qucryNorm 

■ 0.05572349 = (MATCH) ficldWcight(anchor:in 80), 

product of: 

■ 1.0 = tf(tcnnFrcq(anchor:'ti‘^t>=‘>-:')=l) 

■ 4.075775 = idf(docFrcq=5, numDocs=130) 

■ 0.013671875 = ficldNorm(ficld=anchor, doc=80) 

o 8.088332E-5 = (MATCH) wcight(contcnt:>ii‘^t_K>>-:' in 80), product of: 

- 0.03756986 = queryWcight(contcnt:>ti‘^t>=‘'-^), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035326175 = queryNorm 

- 0.0021528779 = (MATCH) fieldWeight(content:'ii‘^L>=‘'-^ in 80), 
product of: 

■ 3.3166249 = tf(termFreq(content:'ii‘^L>=‘'-^)-l 1) 

■ 1.0635134 = idf(docFrcq=121, numDocs=130) 

- 6.1035156E-4 = ficldNorm(field=content, doc=80) 

Page 6: 

• boost = 1.0000145 

. digest = 0d5b023c802941ddb358071073a98833 

• lang = ar 

• segment = 20100305180856 
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title = U'jJls - U'jJls - America.gov 

tstamp = 20100305230902835 
url = http://www.america.gov/ar/ 


score for query: 

. 0.0030584983 = (MATCH) sum of: 

o 0.0030584983 = (MATCH) weight(content:>ti‘^u-=>>-:' in 0), product of: 

■ 0.03756986 = queryWeight(content:>ti‘^tj-=>>-i), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 

■ 0.08140829 = (MATCH) lieldWeight(content:>ii‘^tj-ab in 0), 
product of: 

■ 2.4494898 = tf(terrrLFreq(content:'ii‘^L>=‘'-^)=6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.03125 = fieldNorm(field=content, doc=0) 

Page 7: 

. boost = 0.23009512 

• digest= 15d9ca5e7382f701cd03fb542ae3ab22 

• lang = ar 

• segment = 20100305180909 

• title = o^jj- I jj - America.gov 

. tstamp = 20100305230915350 

• url = http://www.america.gov/ar/multimedia/photogallery.html 

score for query: 

. 6.6904654E-4 = (MATCH) sum of: 

o 6.6904654E-4 = (MATCH) weight(content:'ii‘^L>=‘'-^ in 38), product of: 

■ 0.03756986 = queryWeight(content:>ii‘^L>=‘'-^), product of: 
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■ 1.0635134 = idf(docFreq=121, numDocs=130) 

" 0.035326175 = queryNorm 

- 0.017808065 = (MATCH) fieldWeight(content3ti‘^t>=‘'-^ in 38), 
product of: 

■ 2.4494898 = tf(termFreq(content:'ii‘^L>=‘'-^)=6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0068359375 = fieldNorm(field=content, doc=38) 

Page 8: 

• boost = 0.22996004 

• digest = a0130240b4348578aa8a83e59187dfb3 

• lang = ar 

• segment = 20100305180909 

• title = - Ameriea.gov 

. tstamp = 20100305231001279 

• url = http://www.america.gov/ar/publioations/books.html 
score for query: tjcj^ab 

. 6.4706657E-4 = (MATCH) sum of: 

o 6.4706657E-4 = (MATCH) weight(oontent:>ii‘^t>=‘'-^ in 73), product of: 

- 0.03756986 = queryWeight(content:>ti‘^L>=‘'-^), product of: 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 

■ 0.017223023 = (MATCH) fieldWeight(content:'ti‘^t>=‘'-^ in 73), 

product of: 

■ 3.3166249 = tf(termEreq(content:'ii‘^L>=>'-^)^l 1) 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

- 0.0048828125 = fieldNorm(field=content, doc=73) 
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Page 9; 

• boost = 0.16516872 

• digest = be202e5e0f508e4291bb897eec7814dc 

• lang = ar 

• segment = 20100305180909 

• title = jUu(>j - 1209 - America.gov 
. tstamp = 20100305230941328 

• url = http://www.america.gov/ar/publications/ejoumalusa/1209.html 

score for query: t 

. 5.8529375E-4 = (MATCH) sum of: 

o 5.8529375E-4 = (MATCH) weight(content:>ii‘^t>=‘'-^ in 97), product of: 
- 0.03756986 = queryWeight(content:'ti‘^L>=‘'-^), product of: 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 

■ 0.01557881 = (MATCH) fieldWeight(content:>ii‘^t>=‘'-^ in 97), 

product of: 

■ 3.0 = tf(termFreq(content:'ii‘^L>=>'-^)=9) 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

- 0.0048828125 = fieldNorm(field=content, doc=97) 

Page 10: 

. boost = 0.19712433 

• digest = c25a22alIab6bec420c26625155ced62 

• lang = ar 

• segment = 20100305180909 

• title = U'jJls - U'jJls - America.gov 
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. tstamp = 20100305230929458 
• url = http://www.america.gov/ar/index.html 

score for query: t 

. 5.7346845E-4 = (MATCH) sum of: 

o 5.7346845E-4 = (MATCH) weight(content:'ii‘^L>=>'-^ in 30), product of: 
■ 0.03756986 = queryWeight(content:>ti‘^L>=>'-^), product of: 

- 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.035326175 = queryNorm 

- 0.015264055 = (MATCH) fieldWeight(content:'ti‘^L>=>'-:' in 30), 

product of: 

- 2.4494898 = tf(termEreq(content:>ii‘^L>=>'-^)=6) 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.005859375 = fieldNorm(field=content, doc=30) 
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APPENDIX B 


This is the detail seore for query of top 10 pages using NutchDocumentAnalyzer. 
SearehTerm; ijljoojii (eeomony) 

Page 1; 

• boost = 0.22826105 

. digest = e33a5de3f7d8475491bfafeff Ie8b283 

• lang = ar 

• segment = 20100307101102 

• title = - Ameriea.gov 

. tstamp = 20100307151153574 

• url = http://www.amerioa.gov/ar/eoon.html 

score for query: 

. 0.38501537 = (MATCH) sum of: 

o 0.1991137 = (MATCH) weight(anohor:y'ii‘^L>=‘'-^^2.0 in 16), produot of: 
" 0.29873407 = queryWeight(anchor:yy‘^L>=‘'-^^2.0), product of: 

■ 2.0 = boost 

■ 4.2657595 = idf(docFreq=4, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.6665249 = (MATCH) lieldWeight(anchor:yy‘^L>=‘'-^ in 16), 
product of: 

- 1.0 = tf(termFreq(anchor:yy‘^t>=‘'-^)=l) 

- 4.2657595 = idf(docFreq=4, numDocs=131) 

- 0.15625 = fieldNorm(field=anchor, doc=16) 

o 5.40958E-4 = (MATCH) weight(content:yy‘^t>=‘'-^ in 16), product of: 

■ 0.037221763 = queryWeight(content:yy‘^L>=>'-^), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 
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0.035015345 = queryNorm 


■ 0.01453338 = (MATCH) lieldWeight(content:y>ti‘^tj-ab in 16), 
product of: 

■ 2.0 = tf(termFreq(content:U>ti‘^L>=>'-^)=4) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0068359375 = fieldNorm(field=content, doc=16) 

o 0.18536073 = (MATCH) weight(title:U'ti‘^t>=‘'-^^1.5 in 16), product of: 

■ 0.25088066 = queryWeight(title:1.5), produet of: 

- 1.5= boost 

" 4.776585 = idf(dooFreq=2, numDoos=131) 

■ 0.035015345 = queryNorm 

- 0.7388403 = (MATCH) fieldWeight(title:U'ti‘^t>=‘'-^ in 16), produet 

of: 

- 1.4142135 = tf(termFreq(title:y'ii‘^L>=‘'-^)=2) 

- 4.776585 = idf(dooFreq=2, numDoos=131) 

■ 0.109375 = fieldNorm(field=title, doo=16) 

Page 2: 

• boost = 0.16124225 

• digest = 6120d6b7e6584b6a71b7d9990a68b952 

• lang = ar 

• segment = 20100307101102 

• title = j - Outline of the U.S. Eeonomy - Ameriea.gov 

. tstamp = 20100307151112494 

• url = http://www.amerioa.gov/ar/publioations/books/outline-of-the-us- 
eoonomy.html 

score for query: tjtjcjoab 

. 0.13663794 = (MATCH) sum of: 

o 0.07964548 = (MATCH) weight(anohor:y>ii‘^L>=>'-^^2.0 in 85), produet of: 
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■ 0.29873407 = queryWeight(anchor;product of: 

- 2.0 = boost 

■ 4.2657595 = idf(docFreq=4, numDocs=131) 

■ 0.035015345 = queryNorm 

■ 0.26660997 = (MATCH) fieldWeight(anchor:in 85), 
product of: 

■ 1.0 = tf(termFreq(anchor:y'ii‘^L>=>'-^)^l) 

■ 4.2657595 = idf(docFreq=4, numDocs=131) 

■ 0.0625 = fieldNorm(field=anchor, doc=85) 

o 8.196751E-4 = (MATCH) weight(content:U>ti‘^t>=‘'-^ in 85), product of: 

* 0.037221763 = queryWeight(content:y'ii‘^L>=‘'-^), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035015345 = queryNorm 

■ 0.022021394 = (MATCH) fieldWeight(content:y'ii‘^t>=‘'-^ in 85), 
product of: 

■ 4.2426405 = tf(termFreq(content:U>ii‘^L>=>'-^)=18) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0048828125 = fieldNorm(field=content, doc=85) 

o 0.056172792 = (MATCH) weight(title:U'ii‘^t>=‘'-^^1.5 in 85), product of: 

■ 0.25088066 = queryWeight(title:1.5), product of: 

- 1.5= boost 

- 4.776585 = idf(docFreq=2, numDocs=131) 

- 0.035015345 = queryNorm 

- 0.22390243 = (MATCH) fieldWeight(title:Ul(i‘^L>-'-^ in 85), 

product of: 

■ 1.0 = tf(termFreq(title:y'ii‘^L>=>'-^)=l) 

■ 4.776585 = idf(docFreq=2, numDocs=131) 

■ 0.046875 = fieldNorm(field=title, doc=85) 
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Page 3: 

• boost = 0.16784814 

• digest = 2e923befb9409e9be88aad90198063be 

• lang = ar 

• segment = 20100307101102 

• title = ■ Ameriea.gov 

. tstamp = 20100307151205095 

• url = http://www.america.gov/ar/eoon/business.html 

score for query: 

. 0.09989148 = (MATCH) sum of: 

o 0.09955685 = (MATCH) weight(anchor:y>ti‘^L>=>'-^^2.0 in 17), product of: 

■ 0.29873407 = queryWeight(anchor:U>ii‘^t>=‘'-^^2.0), product of: 

- 2.0 = boost 

■ 4.2657595 = idf(docFreq=4, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.33326244 = (MATCH) fieldWeight(anchor:in 17), 
product of: 

■ 1.0 = tf(termFreq(anchor:y'ii‘^L>=‘'-^)=l) 

" 4.2657595 = idf(docFreq=4, numDocs=131) 

- 0.078125 = fieldNorm(field=anchor, doc=17) 

o 3.3463lE-4 = (MATCH) weight(content:U>ii‘^L>=‘'-^ in 17), product of: 

- 0.037221763 = queryWeight(content:y>ti‘^L>=‘'-^), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.008990197 = (MATCH) fieldWeight(content:yy‘^L>=‘'-^ in 17), 
product of: 

■ 1.7320508 = tf(termFreq(content:yy‘^tj-ab)=3) 
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■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0048828125 = fieldNorm(field=content, doc=17) 

Page 4: 

• boost = 0.02675021 

• digest = 4eb9183dbdc405b0d40eBc92da5ed66 

• lang = ar 

• segment = 20100307101458 

• title = - America.gov 

. tstamp = 20100307151513307 

• url = http://www.america.gOv/ar/publications/books.html#outline_economy 

score for query: 

. 0.01747075 = (MATCH) sum of: 

o 0.017422449 = (MATCH) weight(anchor:U'ti‘^L>=‘''^^2.0 in 81), product of: 

- 0.29873407 = queryWeight(anchor:U>ti‘^t>=‘'-^^2.0), product of: 

■ 2.0 = boost 

- 4.2657595 = idf(docFreq=4, numDocs=131) 

- 0.035015345 = queryNorm 

- 0.058320932 = (MATCH) lieldWeight(anchor:y'ti‘^t>=‘'-^ in 81), 
product of: 

■ 1.0 = tf(termFreq(anchor:y'ii‘^L>=‘'-^)=l) 

- 4.2657595 = idf(docFreq=4, numDocs=131) 

■ 0.013671875 = fieldNorm(field=anchor, doc=81) 

o 4.8299826E-5 = (MATCH) weight(content: in 81), product of: 

- 0.037221763 = queryWeight(content:y'ii‘^L>=‘'-^), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035015345 = queryNorm 
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■ 0.0012976233 = (MATCH) fieldWeight(content:U>ii‘^t_K>b in 81), 
product of: 

■ 2.0 = tf(termFreq(content:U>ti‘^L>=>'-^)=4) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 6.1035156E-4 = fieldNorm(field=content, doc=81) 

Page 5: 

• boost = 1.0000145 

• digest = eed4dd9817b50ffda0aefl58be6e4cl2 

• lang = ar 

• segment = 20100307101052 

• title = Ul>=“-sc“ ■ America.gov 

. tstamp = 20100307151057483 

• url = http://www.america.gov/ar/ 
score for query: tjtjcjo-atj 

. 0.002472951 = (MATCH) sum of: 

o 0.002472951 = (MATCH) weight(content: in 0), product of: 

■ 0.037221763 = queryWeight(content:y'ii‘^L>=‘'-^), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.06643831 = (MATCH) lieldWeight(content:y>ti‘^L>=‘'-^ in 0), 
product of: 

■ 2.0 = tf(termFreq(content:U>ti‘^t>=‘'-^)=4) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03125 = fieldNorm(field=content, doc=0) 

Page 6: 

• boost = 0.23014276 

. digest = 5f50883579dcc0aeb85ff3052764f758 
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• lang = ar 

• segment = 20100307101102 

• title = Umo^jj - mUi.k’jj - America.gov 

. tstamp = 20100307151109851 

• url = http://www.america.gov/ar/multimedia/photogallery.html 

score for query: 

. 5.40958E-4 = (MATCH) sum of: 

o 5.40958E-4 = (MATCH) weight(content:U>ii‘^L>=‘'-^ in 39), product of: 

- 0.037221763 = queryWeight(content:y>ti‘^L>=‘'-^), product of 

■ 1.063013 = idf(docfreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.01453338 = (MATCH) fieldWeight(content:y>ti‘^tj-ab in 39), 

product of: 

■ 2.0 = tf(termFreq(content:U>ti‘^t>=‘'-^)=4) 

- 1.063013 = idf(docEreq=122, numDocs=131) 

- 0.0068359375 = fieldNorm(field=content, doc=39) 

Page 7: 

. boost = 0.19715214 

• digest = e84ec632a6d47466f40d6beacbcfbdf7 

• lang = ar 

• segment = 20100307101102 

• title = U'jJls - U'jJls - America.gov 

. tstamp = 20100307151123237 

• url = http://www.america.gov/ar/index.html 

score for query: 

. 4.636783E-4 = (MATCH) sum of 

o 4.636783E-4 = (MATCH) weight(content:U>ti‘^u-=>'-:' in 31), product of: 
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■ 0.037221763 = queryWeight(content:y'ii‘^c>=‘'-^), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

- 0.012457183 = (MATCH) fieldWeiglit(content:y'ti‘^L>=>'-^ in 31), 
product of: 

■ 2.0 = tf(termFreq(content:U>ti‘^L>=‘'-^)=4) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.005859375 = fieldNorm(field=content, doc=31) 

Page 8: 

• boost = 0.20153543 

. digest = 67da72f899f80475flae62770921flbd 

• lang = ar 

• segment = 20100307101102 

• title = 'jjjm' j''jjjm' j'- America.gov 
. tstamp = 20100307151127260 

• url = http://www.america.gov/ar/world/europe.html 

score for query: 

. 4.636783E-4 = (MATCH) sum of 

o 4.636783E-4 = (MATCH) weight(content:U>ti‘^L>=‘'-^ in 128), product of: 

- 0.037221763 = queryWeight(content:y'ii‘^L>=‘'-^), product of 

- 1.063013 = idf(docfreq=122, numDocs=131) 

■ 0.035015345 = queryNorm 

- 0.012457183 = (MATCH) fieldWeight(content:y'ii‘^t>=‘'-^ in 128), 
product of: 

■ 2.0 = tf(termfreq(content:U>ti‘^L>=>'-^)=4) 

- 1.063013 = idf(docfreq=122, numDocs=131) 

■ 0.005859375 = fieldNorm(field=content, doc=128) 
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Page 9; 

• boost = 0.20091416 

. digest = 9256ef74effb595d81f726d5f347898a 

• lang = ar 

• segment = 20100307101102 

• title = Jl 3 U'jL>4a Jl 3 ii-j J(Jl3lJ^ - 

America.gov 

. tstamp = 20100307151148461 

• url = http://www.america.gov/ar/world/mideast.html 

score for query: 

. 4.636783E-4 = (MATCH) sum of: 

o 4.636783E-4 = (MATCH) weight(content:U>ti‘^L>=>'-^ in 129), product of: 

- 0.037221763 = queryWeight(content:y>ti‘^L>=‘'-^), product of 

■ 1.063013 = idf(docfreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.012457183 = (MATCH) fieldWeight(content:yy‘^L>=‘'-^ in 129), 

product of: 

- 2.0 = tf(termFreq(content:yy‘^tj-ab)=4) 

■ 1.063013 = idf(docEreq=122, numDocs=131) 

- 0.005859375 = fieldNorm(field=content, doc=129) 

Page 10: 

. boost = 0.20039715 

• digest = adbc4a97340b57bcc256c62131041c4c 

• lang = ar 

• segment = 20100307101102 

• title = jlA Jii Jti - America.gov 

. tstamp = 20100307151119356 
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url = http://www.america.gov/ar/world/scasia.html 


score for query: 

. 4.015572E-4 = (MATCH) sum of: 

o 4.015572E-4 = (MATCH) weight(content:U>ti‘^L>=>'-^ in 130), product of: 

■ 0.037221763 = queryWeight(content:y'ii‘^L>=>'-^), product of 

■ 1.063013 = idf(docfreq=122, numDocs=131) 

- 0.035015345 = queryNorm 

■ 0.010788237 = (MATCH) fieldWeight(content:y'ii‘^L>=‘'-^ in 130), 
product of: 

■ 1.7320508 = tf(termEreq(content:U>ii‘^t>=‘'-^)=3) 

- 1.063013 = idf(docfreq=122, numDocs=131) 

■ 0.005859375 = fieldNorm(field=content, doc=130) 
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APPENDIX C 


This is the detail seore for query of top 10 pages using ArabicAnalyzer. 

Seareh Term; (The United States) 

Page 1: 

• boost = 0.15805063 

• digest = 6b6baa67bd29d99a3ef293efeb2be3el 

• lang = ar 

• segment = 20100305181031 

• title = ^^ ^^ ■ Ameriea.gov 

. tstamp = 20100305231050378 

• url = http://www.amerioa.gov/ar/pages/footer/looal/about-us.html 

score for query: 

. 0.1196895 = (MATCH) sum of: 

o 0.059957497 = (MATCH) weight(anohor:>5>t^j^>^2.0 in 68), produot of: 
■ 0.261373 = queryWeight(anchor:'(> c5j^'^ 2.0), product of: 

- 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 

- 0.22939436 = (MATCH) fieldWeight(anchor:in 68), 
product of: 

- 1.0 = tf(termFreq(anchor:'(>c5 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.0625 = fieldNorm(field=anchor, doc=68) 

o 4.8168114E-4 = (MATCH) weight(content: in 68), product of: 

- 0.037867878 = queryWeight(content:>(>c5j'^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 
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0.035606395 = queryNorm 


■ 0.012720046 = (MATCH) fieldWeight(content35>t^in 68), 
product of: 

■ 2.4494898 = tf(termFreq(content:Vc5 j^0“6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0048828125 = fieldNorm(field=content, doc=68) 

o 0.059250325 = (MATCH) weight(title:'(>c5j‘^'^1.5 in 68), product of: 

■ 0.23934121 = queryWeight(title:1.5), produet of: 

- 1.5= boost 

- 4.4812403 = idf(dooFreq=3, numDoos=130) 

■ 0.035606395 = queryNorm 

- 0.24755588 = (MATCH) fieldWeight(title:in 68), produet 

of: 

- 1.4142135 = tf(termFreq(title:'(>c5 j'^>)=2) 

« 4.4812403 = idf(dooFreq=3, numDoos=130) 

■ 0.0390625 = fieldNorm(field=title, doo=68) 

Page 2: 

. boost = 0.16184442 

• digest = 0f454ab63865ae2e08003bb23896bfad 

• lang = ar 

• segment = 20100305180909 

• title = ju - Being Muslim in Ameriea - Ameriea.gov 

. tstamp = 20100305231009575 

• url = http://www.amerioa.gov/ar/publioations/books- 
eontent/musliminameriea.html 

score for query: 

. 0.11078926 = (MATCH) sum of: 

o 0.059957497 = (MATCH) weight(anohor:>(>c5j^'^2.0 in 72), product of: 
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■ 0.261373 = queryWeight(anchor3(>t^j^'^2.0), product of: 

- 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.22939436 = (MATCH) fieldWeight(anchor:in 72), 
product of: 

- 1.0 = tf(termFreq(anchor:'5>c5 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 

- 0.0625 = fieldNorm(field=anchor, doc=72) 

o 5.5619737E-4 = (MATCH) weight(contenf in 72), product of: 

- 0.037867878 = queryWeight(contenf'(>c5j'^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.014687842 = (MATCH) fieldWeight(content:'(>c5in 72), 
product of: 

■ 2.828427 = tf(tennFreq(content: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.0048828125 = fieldNorm(field=content, doc=72) 

o 0.050275568 = (MATCH) weight(title:'(>c5j‘^'^1.5 in 72), product of: 

■ 0.23934121 = queryWeight(title:1.5), product of 

- 1.5= boost 

- 4.4812403 = idf(docFreq=3, numDocs=130) 

- 0.035606395 = queryNorm 

- 0.21005814 = (MATCH) fieldWeight(title:in 72), product 
of 

■ 1.0 = tf(termFreq(title:'f^ 

« 4.4812403 = idf(docFreq=3, numDocs=130) 

■ 0.046875 = fieldNorm(field=title, doc=72) 
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Page 3: 

• boost = 0.23032264 

• digest = ee4al2d589ela56e886d5b6848609391 

• lang = ar 

• segment = 20100305180909 

• title = - Ameriea.gov 

. tstamp = 20100305230939904 

• url = http://www.america.gov/ar/amlife.html 

score for query: 

. 0.105654 = (MATCH) sum of: 

o 0.10492562 = (MATCH) weight(anchor:Vt#J'^'^2.0 in 3), product of: 

- 0.261373 = queryWeight(anchor:'ft^j‘^'^2.0), product of: 

■ 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

- 0.035606395 = queryNorm 

■ 0.40144014 = (MATCH) fieldWeight(anchor:in 3), product 
of: 

■ 1.0 = tf(termFreq(anchor: Vt# 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.109375 = fieldNorm(field=anchor, doc=3) 

o 7.28385 lE-4 = (MATCH) weight(content:>(>t^j^> in 3), product of: 

■ 0.037867878 = queryWeight(content:>f^j'^>), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035606395 = queryNorm 

■ 0.019234907 = (MATCH) fieldWeight(content:Vt#J‘^' in 3), 
product of: 

- 2.6457512 = tf(termFreq(content:'f<^ 
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■ 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.0068359375 = fieldNorm(field=content, doc=3) 

Page 4; 

. boost = 0.15872316 

. digest = dcfeb490d3db633dl6bfb0588d67076d 

• lang = ar 

• segment = 20100305180909 

• title = J J j 

J“ PDA - Ameriea.gov 

. tstamp = 20100305230942596 

• url = http://www.amerioa.gov/ar/servioes/mobile.html 

score for query: 

. 0.042377986 = (MATCH) sum of: 

o 4.8168114E-4 = (MATCH) weight(oontent:in 114), produot of: 

■ 0.037867878 = queryWeight(content:>(>c5j‘^'), produot of: 

- 1.0635134 = idf(dooFreq=121, numDoos=130) 

- 0.035606395 = queryNorm 

- 0.012720046 = (MATCH) fieldWeight(oontent:Vt#in 114), 

product of: 

- 2.4494898 = tf(termFreq(content:Vc5 j^0“6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0048828125 = fieldNorm(field=content, doc=l 14) 

o 0.041896306 = (MATCH) weight(title:'(>c5j'^'^1.5 in 114), product of: 

■ 0.23934121 = queryWeight(title:1.5), product of: 

- 1.5= boost 

■ 4.4812403 = idf(docFreq=3, numDocs=130) 

■ 0.035606395 = queryNorm 
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■ 0.17504844 = (MATCH) fieldWeight(title4(>t^j^' in 114), product 
of: 

■ 1.0 = tf(termFreq(title:'f^ 

■ 4.4812403 = idf(docFreq=3, numDocs=130) 

- 0.0390625 = fieldNorm(field=title, doc=l 14) 

Page 5: 

• boost = 0.04832446 

• digest = 87e8a44e7be9cb3221f6823da385f8dd 

• lang = ar 

• segment = 20100305181031 

• title = MUt.K’jj - America.gov 

. tstamp = 20100305231143209 

• url = 

http://www.america.gOv/ar/multimedia/photogallery.html#/4110/mosques_ar/ 

score for query: 

. 0.022628564 = (MATCH) sum of 

o 0.02248406 = (MATCH) weight(anohor:Vt#J‘^'^2.0 in 50), product of: 

■ 0.261373 = queryWeight(anchor:'(> c5j^'^ 2.0), product of 

- 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.08602288 = (MATCH) fieldWeight(anchor:in 50), 
product of: 

■ 1.0 = tf(termFreq(anchor:'5>c5 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 

- 0.0234375 = fieldNorm(field=anchor, doc=50) 

o 1.4450435E-4 = (MATCH) weight(content: in 50), product of: 
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■ 0.037867878 = queryWeight(content3f^j‘^'), product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035606395 = queryNorm 

- 0.0038160137 = (MATCH) fieldWeight(content:'ft^j‘^> in 50), 
product of: 

- 2.4494898 = tf(terrrLFreq(content:Vc5 j^0“6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0014648438 = fieldNorm(field=content, doc=50) 

Page 6: 

. boost = 0.033444975 

• digest = 7212084a79cdl9adbfc07dc50d3e0ea4 

• lang = ar 

• segment = 20100305181031 

• title = - America.gov 

. tstamp = 20100305231138931 

• url = http://www.america.gOv/ar/publications/books.html#beingmuslim 

score for query: 

. 0.015091554 = (MATCH) sum of: 

o 0.014989374 = (MATCH) weight(anohor:^(>c5j^'^2.0 in 75), product of: 

- 0.261373 = queryWeight(anchor:>(> c5j^>^ 2.0), product of: 

- 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 

- 0.05734859 = (MATCH) fieldWeight(anchor:in 75), 
product of: 

- 1.0 = tf(termFreq(anchor:'(>c5 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 
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0.015625 = fieldNorm(field=anchor, doc=75) 


o 1.0217999E-4 = (MATCH) weight(content: in 75), product of: 

- 0.037867878 = queryWeight(content:'(>c5j'^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.002698329 = (MATCH) fieldWeight(content:Vt#J‘^' in 75), 

product of: 

■ 3.4641016 = tf(terrrLFreq(content: 12) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 7.324219F-4 = fieldNorm(field=content, doc=75) 

Page 7: 

• boost = 0.0420541 

. digest = 3354b6239b6eb27b9d241073f88fc34e 

• lang = ar 

• segment = 20100305181031 

• title = jj - America.gov 

. tstamp = 20100305231110244 

• url = 

http://www.america.gOv/ar/multimedia/photogallery.html#/4110/religious_freedo 
m_ar/ 

score for query: 

. 0.01136245 = (MATCH) sum of: 

o 0.01124203 = (MATCH) weight(anchor:Vt#J‘^'^2.0 in 52), product of: 

- 0.261373 = queryWeight(anchor:'(> c5j‘^>^ 2.0), product of: 

■ 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 
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■ 0.04301144 = (MATCH) fieldWeight(anchor4(>t^in 52), 
product of: 

■ 1.0 = tf(termFreq(anchor: Vt# J‘^')~l) 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 

" 0.01171875 = fieldNorm(field=anchor, doc=52) 

o 1.20420285E-4 = (MATCH) weight(content:>(>c5j^' in 52), product of: 

- 0.037867878 = queryWeight(content:>5>c5j‘^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.0031800114 = (MATCH) lieldWeight(content:>fc5in 52), 
product of: 

« 2.4494898 = tf(termFreq(content:'fc5 j^0“6) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0012207031 = fieldNorm(field=content, doc=52) 

Page 8: 

• boost = 0.02675021 

. digest = d4493509fble3146c2003310o9b70obd 

• lang = ar 

• segment = 20100305181330 

• title = - America.gov 

. tstamp = 20100305231409034 

• url = http://www.america.gOv/ar/publications/books.html#governed 

score for query: 

. 0.01132718 = (MATCH) sum of: 

o 0.01124203 = (MATCH) weight(anchor:Vt#J‘^'^2.0 in 77), produet of: 

- 0.261373 = queryWeight(anohor:'fc5j‘^'^2.0), product of: 

■ 2.0 = boost 
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■ 3.6703098 = idf(docFreq=8, numDocs=130) 

" 0.035606395 = queryNorm 

■ 0.04301144 = (MATCH) fieldWeight(anchor4(>t^in 77), 

product of: 

- 1.0 = tf(termFreq(anchor:'5>c5 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.01171875 = fieldNorm(field=anchor, doc=77) 

o 8.514999E-5 = (MATCH) weight(content:'(>c5j^> in 77), product of: 

- 0.037867878 = queryWeight(content:'(>c5j'^>), product of: 

" 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

- 0.0022486076 = (MATCH) fieldWeight(content:'fc5j‘^> in 77), 
produet of: 

■ 3.4641016 = tf(termFreq(content: 12) 

■ 1.0635134 = idf(dooFreq=121, numDocs=130) 

■ 6.1035156E-4 = fieldNorm(field=content, doo=77) 

Page 9: 

. boost = 0.025411258 

. digest = 6b7361561b7255632af783ea69a88410 

• lang = ar 

• segment = 20100305181330 

• title = UmMU t.K’jj - Ameriea.gov 
. tstamp = 20100305231411435 

• url = http://www.america.gOv/ar/multimedia/photogallery.html#/4110/islam_ar/ 

score for query: 

. 0.011314282 = (MATCH) sum of: 

o 0.01124203 = (MATCH) weight(anohor:Vt#J‘^'^2.0 in 49), product of: 
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■ 0.261373 = queryWeight(anchor3(>t^j^'^2.0), product of: 

- 2.0 = boost 

- 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.04301144 = (MATCH) fieldWeight(anchor:j^' in 49), 
product of: 

- 1.0 = tf(termFreq(anchor:'5>c5 

■ 3.6703098 = idf(docFreq=8, numDocs=130) 

■ 0.01171875 = fieldNorm(field=anchor, doc=49) 

o 7.225217E-5 = (MATCH) weight(content:>(>c5j‘^> in 49), product of 
- 0.037867878 = queryWeight(contenf>(>c5j'^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.0019080068 = (MATCH) fieldWeight(content:>fc5in 49), 
product of: 

■ 2.4494898 = tf(termFreq(content:'j>c5 j^0“6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 7.324219E-4 = fieldNorm(field=content, doc=49) 

Page 10: 

• boost = 1.0000145 

. digest = 0d5b023c802941ddb358071073a98833 

• lang = ar 

• segment = 20100305180856 

• title = Ijl jiXs - America.gov 

. tstamp = 20100305230902835 

• url = http://www.america.gov/ar/ 

score for query: 
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. 0.0030827592 = (MATCH) sum of: 

o 0.0030827592 = (MATCH) weight(content: in 0), product of: 

- 0.037867878 = queryWeight(content:'f^j'^'), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035606395 = queryNorm 

■ 0.08140829 = (MATCH) fieldWeight(content:in 0), product 

of: 

■ 2.4494898 = tf(termFreq(content:'j>t^ j^0“6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

" 0.03125 = fieldNorm(field=content, doc=0) 
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APPENDIX D 


This is the detail seore for query of top 10 pages using NutchDocumentAnalyzer. 
Seareh Term; (Ameriea) 

Page 1: 

• boost = 0.1580853 

. digest = 65d01f780ed747de9fd07241fb39df44 

• lang = ar 

• segment = 20100307101231 

• title = ^^^ ■ Ameriea.gov 

. tstamp = 20100307151249455 

• url = http://www.amerioa.gov/ar/pages/footer/looal/about-us.html 

score for query: 

. 0.11997691 = (MATCH) sum of: 

o 0.060125146 = (MATCH) weight(anohor:Vt#J^'^2.0 in 69), produot of: 

- 0.26155776 = queryWeight(anchor:Vt#J'^'^2.0), product of: 

- 2.0 = boost 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.035557326 = queryNorm 

- 0.2298733 = (MATCH) fieldWeight(anchor:>(>c5j'^> in 69), product 
of: 

■ 1.0 = tf(termFreq(anchor:'5>c5 

- 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.0625 = fieldNorm(field=anchor, doc=69) 

o 4.8056475E-4 = (MATCH) weight(content:Vt#J^' in 69), product of: 

■ 0.037797898 = queryWeight(content:>(>c5j'^'), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 
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■ 0.035557326 = queryNorm 

■ 0.01271406 = (MATCH) lieldWeight(content3f^in 69), 
product of: 

■ 2.4494898 = tf(termFreq(content:Vt# j^0“6) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0048828125 = fieldNorm(field=content, doc=69) 
o 0.0593712 = (MATCH) weight(title:Vt#J‘^'^1.5 in 69), product of: 

■ 0.23942009 = queryWeight(title:Vt#J^'^1.5), product of: 

- 1.5= boost 

" 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.035557326 = queryNorm 

- 0.2479792 = (MATCH) fieldWeight(title:Vt#J^' in 69), product of: 

■ 1.4142135 = tf(termFreq(title:Vt# J'^0~2) 

- 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.0390625 = fieldNorm(field=title, doc=69) 

Page 2: 

. boost = 0.16184442 

• digest = be96f39b462a546d99ebfa50ba70e710 

• lang = ar 

• segment = 20100307101102 

• title = ju - Being Muslim in Ameriea - Ameriea.gov 

. tstamp = 20100307151207698 

• url = http://www.amerioa.gov/ar/publioations/books- 
eontent/musliminameriea.html 

score for query: 

. 0.11105819 = (MATCH) sum of: 

o 0.060125146 = (MATCH) weight(anohor:Vt#J^'^2.0 in 73), product of: 
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■ 0.26155776 = queryWeight(anchor:Vt#J‘^'^2.0), product of: 

" 2.0 = boost 

- 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.2298733 = (MATCH) fieldWeight(anchor:5f^j‘^> in 73), product 
of 

- 1.0 = tf(termFreq(anchor:'5>c5 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.0625 = fieldNorm(field=anchor, doc=73) 

o 5.5490836E-4 = (MATCH) weight(content:Vt#J^' in 73), product of: 

■ 0.037797898 = queryWeight(content:i(>c5j'^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.014680931 = (MATCH) fieldWeight(content:Vt#J‘^' in 73), 
product of: 

■ 2.828427 = tf(tennFreq(content: Vt# 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0048828125 = fieldNorm(field=content, doc=73) 

o 0.050378136 = (MATCH) weight(title:Vt#J‘^'^1.5 in 73), product of: 

■ 0.23942009 = queryWeight(title:V<^J^'^1.5), product of 

- 1.5= boost 

- 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.035557326 = queryNorm 

- 0.21041733 = (MATCH) fieldWeight(title:Vt#J^' in 73), product 

of 

■ 1.0 = tf(termFreq(title:'f^ 

■ 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.046875 = fieldNorm(field=title, doc=73) 
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Page 3: 

• boost = 0.23039404 

• digest = 8ed8fed743fflce5d4e42db83fc549af 

• lang = ar 

• segment = 20100307101102 

• title = - Ameriea.gov 

. tstamp = 20100307151136760 

• url = http://www.amerioa.gov/ar/amlife.html 

score for query: 

. 0.10594571 = (MATCH) sum of: 

o 0.10521901 = (MATCH) weight(anchor:Vt#J‘^'^2.0 in 3), product of: 

■ 0.26155776 = queryWeight(anchor:Vt#J‘^'^2.0), product of: 

- 2.0 = boost 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.035557326 = queryNorm 

- 0.40227827 = (MATCH) fieldWeight(anchor:Vt#J^' in 3), product 
of: 

- 1.0 = tf(termFreq(anchor:' 5 >t^ 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.109375 = fieldNorm(field=anchor, doc=3) 

o 7.2669686E-4 = (MATCH) weight(content:Vt#J^' in 3), product of: 

- 0.037797898 = queryWeight(content:>f^j'^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.019225854 = (MATCH) fieldWeight(content:Vt#J‘^' in 3), 
product of: 
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■ 2.6457512 = tf(termFreq(content: Vt# 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0068359375 = fieldNorm(field=content, doc=3) 

Page 4: 

. boost = 0.1587577 

• digest = a5795145f4a839ef52528dlb49e03bdl 

• lang = ar 

• segment = 20100307101102 

• title = J J j 

j“ PDA - Ameriea.gov 

. tstamp = 20100307151139253 

• url = http://www.america.gov/ar/services/mobile.html 

score for query: 

. 0.042462345 = (MATCH) sum of: 

o 4.8056475E-4 = (MATCH) weight(content:Vt#J^' in 115), product of: 

- 0.037797898 = queryWeight(content:i(>c5j‘^>), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.01271406 = (MATCH) lieldWeight(content:'(>c5in 115), 

product of: 

■ 2.4494898 = tf(termFreq(content:Vt# j^0“6) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.0048828125 = fieldNorm(field=content, doc=l 15) 
o 0.04198178 = (MATCH) weight(title:Vt#J‘^'^1.5 in 115), product of: 

- 0.23942009 = queryWeight(title:Vt#J^'^1.5), product of: 

- 1.5= boost 

■ 4.488903 = idf(docFreq=3, numDocs=131) 
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■ 0.035557326 = queryNorm 

■ 0.17534778 = (MATCH) fieldWeight(title:Vt#J^' in 115), product 

of: 

■ 1.0 = tf(termFreq(title:ify5 

" 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.0390625 = fieldNorm(field=title, doc=l 15) 

Page 5: 

• boost = 0.04832446 

• digest = 22e534f031e9e7ac8682fed4f86523e4 

• lang = ar 

• segment = 20100307101231 

• title = UmMU t.K’jj - America.gov 
. tstamp = 20100307151334977 

• url = 

http://www.america.gOv/ar/multimedia/photogallery.html#/4110/mosques_ar/ 

score for query: 

. 0.022691099 = (MATCH) sum of 

o 0.02254693 = (MATCH) weight(anohor:i(>c5j‘^>^2.0 in 51), produet of: 

- 0.26155776 = queryWeight(anohor:Vt#J‘^'^2.0), produet of 

- 2.0 = boost 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.035557326 = queryNorm 

- 0.08620249 = (MATCH) fieldWeight(anohor:Vt#in 51), 
product of: 

■ 1.0 = tf(termFreq(anchor: Vt# 

- 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.0234375 = fieldNorm(field=anchor, doc=51) 
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o 1.4416942E-4 = (MATCH) weight(content: Vt#in 51), product of: 

- 0.037797898 = queryWeight(content:if^j‘^'), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.0038142179 = (MATCH) lieldWeight(content:Vt#in 51), 
product of: 

■ 2.4494898 = tf(termFreq(content:Vt# j^0“6) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0014648438 = fieldNorm(field=content, doc=51) 

Page 6: 

. boost = 0.033444975 

. digest = 80e97402726fad635131dblbb29555be 

• lang = ar 

• segment = 20100307101231 

• title = - Ameriea.gov 

. tstamp = 20100307151330985 

• url = http://www.amerioa.gOv/ar/publioations/books.html#beingmuslim 

score for query: 

. 0.01513323 = (MATCH) sum of: 

o 0.0150312865 = (MATCH) weight(anohor:Vt#J‘^'^2.0 in 76), produot of: 

■ 0.26155776 = queryWeight(anohor:Vt#J‘^'^2.0), produot of: 

- 2.0 = boost 

■ 3.6779728 = idf(dooFreq=8, numDoos=131) 

■ 0.035557326 = queryNorm 

- 0.057468325 = (MATCH) fieldWeight(anohor:Vc5j‘^' in 76), 
produot of: 

■ 1.0 = tf(termFreq(anohor: Vt# 
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■ 3.6779728 = idf(docFreq=8, numDocs=131) 

- 0.015625 = fieldNorm(field=anchor, doc=76) 

o 1.01943166E-4 = (MATCH) weight(content:Vt#J^' in 76), product of: 

- 0.037797898 = queryWeight(content:i5>c5j‘^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.0026970592 = (MATCH) fieldWeight(content:Vt#J^' in 76), 
product of: 

■ 3.4641016 = tf(termFreq(content:Vt# j^0~12) 

" 1.063013 = idf(docFreq=122, numDocs=131) 

■ 7.324219F-4 = fieldNorm(field=content, doc=76) 

Page 7: 

• boost = 0.0420541 

• digest = Ielbe6ad9ffbfedb82ea012b44610bed 

• lang = ar 

• segment = 20100307101231 

• title = UmMU t.K’jj - Ameriea.gov 
. tstamp = 20100307151307072 

• url = 

http://www.amerioa.gOv/ar/multimedia/photogallery.html#/4110/religious_freedo 
m_ar/ 

score for query: 

. 0.011393607 = (MATCH) sum of 

o 0.011273465 = (MATCH) weight(anohor:Vt#J^'^2.0 in 53), produot of: 

■ 0.26155776 = queryWeight(anchor:Vt#J‘^'^2.0), product of 

- 2.0 = boost 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 
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■ 0.035557326 = queryNorm 

■ 0.043101244 = (MATCH) lieldWeight(anchor:Vc5in 53), 
product of: 

■ 1.0 = tf(termFreq(anchor: Vt# 

- 3.6779728 = idf(docFreq=8, numDocs=131) 

- 0.01171875 = fieldNorm(field=anchor, doc=53) 

o 1.2014119E-4 = (MATCH) weight(content:Vt#J^' in 53), product of: 

■ 0.037797898 = queryWeight(content:i(>c5j‘^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.035557326 = queryNorm 

- 0.003178515 = (MATCH) fieldWeight(content:Vc5j'^' in 53), 

product of: 

- 2.4494898 = tf(termFreq(content:Vt# j^0“6) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0012207031 = fieldNorm(field=content, doc=53) 

Page 8: 

• boost = 0.02675021 

• digest = df2eeaef879a60aaaddf7e8403eba7fa 

• lang = ar 

• segment = 20100307101458 

• title = - Ameriea.gov 

. tstamp = 20100307151541037 

• url = http://www.america.gOv/ar/publioations/books.html#governed 

score for query: 

. 0.011358418 = (MATCH) sum of: 

o 0.011273465 = (MATCH) weight(anohor:Vt#J^'^2.0 in 78), product of: 

■ 0.26155776 = queryWeight(anchor:Vt#J‘^'^2.0), product of: 
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2.0 = boost 


- 3.6779728 = idf(docFreq=8, numDocs=131) 

- 0.035557326 = queryNorm 

- 0.043101244 = (MATCH) lieldWeight(anchor:Vt#in 78), 
product of: 

■ 1.0 = tf(termFreq(anchor: Vt# 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.01171875 = fieldNorm(field=anchor, doc=78) 

o 8.495264E-5 = (MATCH) weight(content:Vt#J^' in 78), product of: 

- 0.037797898 = queryWeight(content:i5>c5j‘^'), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.0022475494 = (MATCH) lieldWeight(content:Vt#in 78), 

product of: 

■ 3.4641016 = tf(termFreq(content:Vt# j^0~12) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

- 6.1035156E-4 = fieldNorm(field=content, doc=78) 

Page 9: 

. boost = 0.025411258 

• digest = 295971814b3454a9d44144054b5el94a 

• lang = ar 

• segment = 20100307101458 

• title = MUt.K’jj - Ameriea.gov 

. tstamp = 20100307151543423 

• url = http://www.amerioa.gOv/ar/multimedia/photogallery.html#/4110/islam_ar/ 

score for query: 

. 0.0113455495 = (MATCH) sum of: 
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o 0.011273465 = (MATCH) weight(anchor:Vt#J^'^2.0 in 50), product of: 

- 0.26155776 = queryWeight(anchor:Vt#J‘^'^2.0), product of: 

■ 2.0 = boost 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.043101244 = (MATCH) fieldWeight(anchor:Vt#in 50), 

product of: 

■ 1.0 = tf(termFreq(anchor: Vt# 

■ 3.6779728 = idf(docFreq=8, numDocs=131) 

" 0.01171875 = fieldNorm(field=anchor, doc=50) 

o 7.208471E-5 = (MATCH) weight(content:Vt#J^' in 50), product of: 

- 0.037797898 = queryWeight(content:i5>c5j‘^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

- 0.0019071089 = (MATCH) fieldWeight(content:Vt#J^' in 50), 
product of: 

« 2.4494898 = tf(termFreq(content:Vt# j^0“6) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 7.324219E-4 = fieldNorm(field=content, doc=50) 

Page 10: 

• boost = 1.0000145 

• digest = eed4dd9817b50ffda0aefl58be6e4cl2 

• lang = ar 

• segment = 20100307101052 

• title = U'jJls - U'jJls - Ameriea.gov 

. tstamp = 20100307151057483 

• url = http://www.america.gov/ar/ 
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score for query: 

. 0.0028076388 = (MATCH) sum of: 

o 0.0028076388 = (MATCH) weight(content:Vt#J^' in 0), product of: 

- 0.037797898 = queryWeight(content:if^j‘^'), product of: 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.035557326 = queryNorm 

■ 0.07428029 = (MATCH) fieldWeight(content:if^j‘^> in 0), product 

of: 

- 2.236068 = tf(termFreq(content: Vt# 

" 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03125 = fieldNorm(field=content, doc=0) 
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APPENDIX E 


This is the detail score for query of top 10 pages using ArabicAnalyzer. 

Search Term: (Democratic) 

Page 1: 

• boost = 0.16689056 

• digest = 6elb0463970c5b60bb75636a698cflb3 

• lang = ar 

• segment = 20100305180909 

• title = - America.gov 

. tstamp = 20100305230951886 

• url = http://www.america.gov/ar/global/democracy.html 

score for query: pjj i h 

• 0.2665834 = (MATCH) sum of: 

o 0.15995954 = (MATCH) weight(anchor:-\5ptijda^2.0 in 23), product of: 

- 0.30052778 = queryWeight(anchor:-\5ptijd=^2.0), product of: 

■ 2.0 = boost 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.035288982 = queryNorm 

■ 0.5322621 = (MATCH) lieldWeight(anchor:in 23), 
product of: 

■ 1.0 = tf(termFreq(anchor:-:'c5ptij'-la)=l) 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.125 = fieldNorm(field=anchor, doc=23) 

o 5.846775E-4 = (MATCH) weight(content:-:'c 5 ptijda in 23), product of: 

- 0.037530307 = queryWeight(content:-:ic 5 ptij'-la)^ product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 
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0.035288982 = queryNorm 


■ 0.01557881 = (MATCH) fieldWeight(content:-^<^(>tij'-ia in 23), 
product of: 

■ 3.0 = tf(termFreq(content:-\5j>tijda)=9) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.0048828125 = fieldNorm(field=content, doc=23) 

o 0.1060392 = (MATCH) weight(title:-:'c5(>tij'4a^l.5 in 23), product of: 

■ 0.22539584 = queryWeight(title:-:'c5(>tijd=^l .5), product of: 

■ 1.5= boost 

- 4.2580967 = idf(docFreq=4, numDocs=130) 

■ 0.035288982 = queryNorm 

- 0.47045767 = (MATCH) fieldWeight(title:-:'c 5 (>tijd= in 23), product 

of: 

■ 1.4142135 = tf(termFreq(title:-:'c5(>tij'43)=2) 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.078125 = fieldNorm(field=title, doc=23) 

Page 2: 

. boost = 0.23113073 

. digest = 5285dc46473be73851750b409de012a5 

• lang = ar 

• segment = 20100305180909 

• title = y^yj>t^ - America.gov 

. tstamp = 20100305230921085 

• url = http://www.america.gov/ar/global.html 
score for query: pjj i h 

• 0.16062789 = (MATCH) sum of: 

o 0.15995954 = (MATCH) weight(anchor:-\5pyjd=^2.0 in 22), product of: 
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■ 0.30052778 = queryWeight(anchor;-\5(>tij0a^2.0), product of: 

" 2.0 = boost 

- 4.2580967 = idf(docFreq=4, numDocs=130) 

■ 0.035288982 = queryNorm 

■ 0.5322621 = (MATCH) lieldWeight(anchor:in 22), 
product of: 

■ 1.0 = tf(termFreq(anchor:-:'c5j>tij'3a)=l) 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

■ 0.125 = lieldNorm(lield=anchor, doc=22) 

o 6.683421E-4 = (MATCH) weight(content:-:'c 5 (>tijda in 22), product of: 

- 0.037530307 = queryWeight(content:-:'c5(>tij'3a), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035288982 = queryNorm 

- 0.017808065 = (MATCH) fieldWeight(content:-:ic5(>tij'3a in 22), 
product of: 

« 2.4494898 = tf(termFreq(content:-:>c5(>tijda)=6) 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0068359375 = fieldNorm(field=content, doc=22) 

Page 3: 

. boost = 0.031816483 

• digest = bba906e38386b2e71f42a4f7d365e8eb 

• lang = ar 

• segment = 20100305181031 

• title = j'ti - Ameriea.gov 

. tstamp = 20100305231058196 

• url = http://www.america.gov/ar/publioations/ejoumalusa/608.html 

score for query: pjj i h 

• 0.033635326 = (MATCH) sum of: 
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o 0.017495574 = (MATCH) weight(anchor:-\5(>tij'-i=^2.0 in 98), product of: 

- 0.30052778 = qucryWcight(anchor:-\5(>tijO=^2.0), product of: 

- 2.0 = boost 

■ 4.2580967 = idf(docFrcq=4, numDocs=130) 

■ 0.035288982 = qucryNorm 

- 0.058216166 = (MATCH) ficldWcight(anchor:-:'c5(>tij'3a in 98), 
product of: 

■ 1.0 = tf(tcrmFrcq(anchor:-:'c5j>tij'3a)=l) 

■ 4.2580967 = idf(docFrcq=4, numDocs=130) 

- 0.013671875 = ficldNorm(ficld=anchor, doc=98) 

o 2.33871E-4 = (MATCH) wcight(contcnt:-:'c 5 (>tijda in 98), product of: 

- 0.037530307 = qucryWcight(contcnt:-:'c5(>tij'3a), product of: 

■ 1.0635134 = idf(docFrcq=121, numDocs=130) 

■ 0.035288982 = qucryNorm 

- 0.006231524 = (MATCH) ficldWcight(contcnt:-:'c5(>tij'3a in 98), 
product of: 

■ 6.0 = tf(tcnnFrcq(contcnt:-\5j>tijd=)=36) 

■ 1.0635134 = idf(docFrcq=121, numDocs=130) 

- 9.765625E-4 = ficldNorm(ficld=contcnt, doc=98) 

o 0.015905881 = (MATCH) wcight(title1.5 in 98), product of 

- 0.22539584 = queryWeight(title:-:'c5(>tijda^l .5), product of: 

■ 1.5= boost 

■ 4.2580967 = idf(docfreq=4, numDocs=130) 

■ 0.035288982 = qucryNorm 

■ 0.07056865 = (MATCH) fieldWeight(title:-:'c 5 (>tijd= in 98), product 

of 

■ 1.4142135 = tf(termFreq(titlc:-:'c5(>tij'43)=2) 

- 4.2580967 = idf(docfreq=4, numDocs=130) 
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■ 0.01171875 = fieldNorm(field=title, doc=98) 

Page 4; 

. boost = 0.11378951 

• digest = b8cl57220365a4bfl04be045832885be 

• lang = ar 

• segment = 20100305180909 

• title = J I jClijJ - 0110 - 

America.gov 

. tstamp = 20100305231013594 

• url = http://www.america.gov/ar/publications/ejoumalusa/0110.html 
score for query: pjj i h 

• 0.030587077 = (MATCH) sum of: 

o 5.9466175E-4 = (MATCH) weight(content:-\5j>tij'3a in 88), product of: 

- 0.037530307 = queryWeight(content:-:'c5(>iij'3a)^ product of: 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035288982 = queryNorm 

■ 0.01584484 = (MATCH) lieldWeight(content:-:'c5(>tij'3a in 88), 

product of: 

■ 4.358899 = tf(termFreq(content:-:'c5(>tijda)=19) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.0034179688 = fieldNorm(field=content, doc=88) 

o 0.029992415 = (MATCH) weight(title1.5 in 88), product of: 

- 0.22539584 = queryWeight(title:-^c5(>tijda^l .5), product of: 

■ 1.5= boost 

- 4.2580967 = idf(docFreq=4, numDocs=130) 

■ 0.035288982 = queryNorm 

- 0.13306552 = (MATCH) fieldWeight(title:-^c 5 (>tijd= in 88), product 
of: 
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■ 1.0 = tf(termFreq(title:-^t^(>iijO=)=l) 

- 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.03125 = fieldNorm(field=title, doc=88) 

Page 5: 

• boost = 0.028445216 

. digest = e04e43d37fb6f380a397373427882ale 

• lang = ar 

• segment = 20100305181151 

• title = I f ^ jjo j - America.gov 

. tstamp = 20100305231252420 

• url = http://www.america.gov/ar/democracy/globaPindex.html 
score for query: pjj i h 

• 0.027611194 = (MATCH) sum of: 

o 0.019994942 = (MATCH) weight(anchor:-\5(>tij'3a^2.0 in 14), product of: 

■ 0.30052778 = queryWeight(anchor:-\5(>tijd=^2.0), product of: 

- 2.0 = boost 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.035288982 = queryNorm 

■ 0.06653276 = (MATCH) fieldWeight(anchor:in 14), 
product of: 

■ 1.0 = tf(termFreq(anchor:-:'c5j>tij'3a)=l) 

- 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.015625 = lieldNorm(lield=anchor, doc=14) 

o 1.181473E-4 = (MATCH) weight(content:-:'c 5 j>tijda in 14), product of: 

- 0.037530307 = queryWeight(content:-:'c5(>tij'3a), product of: 

■ 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.035288982 = queryNorm 
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■ 0.0031480505 = (MATCH) fieldWeight(content:-\5(>tij0a in 14), 

product of: 

■ 3.4641016 = tf(termFreq(content:-:'c5j>tijda)=12) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 8.544922E-4 = lieldNorm(lield=content, doc=14) 

o 0.0074981037 = (MATCH) weight(title:-:'c5(>tijd=^1.5 in 14), product of: 

- 0.22539584 = queryWeight(title:-:'c5(>tijda^l .5), product of: 

■ 1.5= boost 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

- 0.035288982 = queryNorm 

- 0.03326638 = (MATCH) fieldWeight(title:-:'c 5 (>tijd= in 14), product 
of: 

■ 1.0 = tf(termFreq(title:-:'c5(>tijda)=l) 

■ 4.2580967 = idf(docFreq=4, numDocs=130) 

■ 0.0078125 = fieldNorm(field=title, doc=14) 

Page 6: 

• boost = 1.0000145 

. digest = 0d5b023c802941ddb358071073a98833 

• lang = ar 

• segment = 20100305180856 

• title = 'J'jJls - ■ America.gov 

. tstamp = 20100305230902835 

• url = http://www.america.gov/ar/ 
score for query: pjj i h 

• 0.0021604078 = (MATCH) sum of: 

o 0.0021604078 = (MATCH) weight(content:-\5ptij'3a in 0), product of: 

- 0.037530307 = queryWeight(content:-:'c5ptij'3a), produet of: 

■ 1.0635134 = idf(dooFreq=121, numDoos=130) 
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0.035288982 = queryNorm 


■ 0.05756435 = (MATCH) fieldWeight(content:-^t^(>tij'3a in 0), 

product of: 

- 1.7320508 = tf(termFreq(content:-:'t^(>tijda)=3) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.03125 = fieldNorm(field=content, doc=0) 

Page 7: 

• boost = 0.22860475 

• digest = 5a62ed3a20d5393ff5806fd92afledef 

• lang = ar 

• segment = 20100305180909 

• title = M- America.gov 


. tstamp = 20100305230934690 

• url = http://www.america.gov/ar/multimedia/podcast.html 
score for query: pjj i h 

• 6.1011006E-4 = (MATCH) sum of: 

o 6.1011006E-4 = (MATCH) weight(content:-\5(>tij'3= in 60), product of: 

■ 0.037530307 = queryWeight(content:-:ic5(>tij'3a)^ product of: 

- 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.035288982 = queryNorm 

■ 0.016256463 = (MATCH) fieldWeight(oontent:-:'c5(>tij'3a in 60), 
produet of: 

■ 2.236068 = tf(termEreq(content:-:'c5j>tijda)=5) 

■ 1.0635134 = idf(dooEreq=121, numDoos=130) 

■ 0.0068359375 = fieldNorm(field=oontent, doe=60) 

Page 8: 

• boost = 0.22996004 

• digest = a0130240b4348578aa8a83e59187dfb3 
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• lang = ar 

• segment = 20100305180909 

• title = - America.gov 

. tstamp = 20100305231001279 

• url = http://www.america.gov/ar/publications/books.html 

score for query: pjj i h 

• 5.846775E-4 = (MATCH) sum of: 

o 5.846775E-4 = (MATCH) weight(content:-:'t^ptijO= in 73), product of: 

- 0.037530307 = queryWeight(content:-:'c5ptij'3a)^ product of: 

■ 1.0635134 = idf(docfreq=121, numDocs=130) 

- 0.035288982 = queryNorm 

- 0.01557881 = (MATCH) fieldWeight(content:-:'c5ptij'3= in 73), 
product of: 

- 3.0 = tf(termFreq(content:-\5ptijda)=9) 

- 1.0635134 = idf(docEreq=121, numDocs=130) 

- 0.0048828125 = fieldNorm(field=content, doc=73) 

Page 9: 

• boost = 0.23032264 

• digest = ce4al2d589cla56e886d5b6848609391 

• lang = ar 

• segment = 20100305180909 

• title = - America.gov 

. tstamp = 20100305230939904 

• url = http://www.america.gov/ar/amlife.html 
score for query: pjj i h 

• 5.4569903E-4 = (MATCH) sum of 

o 5.4569903E-4 = (MATCH) weight(content:-\5ptij'3a in 3), product of 

■ 0.037530307 = queryWeight(content:-:'c5piij'3a)^ product of: 
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■ 1.0635134 = idf(docFreq=121, numDocs=130) 

- 0.035288982 = queryNorm 

■ 0.0145402225 = (MATCH) lieldWeight(content:-\5(>tijd= in 3), 
product of: 

■ 2.0 = tf(termFreq(content:-\5j>tijd=)=4) 

- 1.0635134 = idf(docFreq=121, numDocs=130) 

■ 0.0068359375 = fieldNorm(field=content, doc=3) 

Page 10: 

• boost = 0.2296042 

• digest = be9c562d0a61b335f5a8730fl4412deb 

• lang = ar 

• segment = 20100305180909 

• title = ](j 2JJu'J S■ America.gov 

. tstamp = 20100305230924025 

• url = http://www.america.gov/ar/publications/ejoumalusa.html 

score for query: pjj i h 

• 5.4010196E-4 = (MATCH) sum of: 

o 5.4010196E-4 = (MATCH) weight(content:-\5ptij'3a in 86), produet of: 
- 0.037530307 = queryWeight(content:-:'c5piij'3a), product of: 

■ 1.0635134 = idf(docEreq=121, numDocs=130) 

■ 0.035288982 = queryNorm 

■ 0.014391088 = (MATCH) fieldWeight(content:-:ic5ptij'3a in 86), 
product of: 

■ 3.4641016 = tf(termFreq(content:-:'c5ptijO=)=12) 

- 1.0635134 = idf(docEreq=121, numDocs=130) 

- 0.00390625 = fieldNorm(field=content, doc=86) 
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APPENDIX F 


This is the detail seore for query of top 10 pages using NutchDocumentAnalyzer. 
Seareh Term; (Democratie) 

Page 1; 

. boost = 0.16692342 

. digest = c49dl3elfa4eb518258862a27800B98 

• lang = ar 

• segment = 20100307101102 

• title = - Ameriea.gov 

. tstamp = 20100307151151020 

• url = http://www.america.gov/ar/global/demooracy.html 
score for query: i j pjj i s 

. 0.29354417 = (MATCH) sum of: 

o 0.17619587 = (MATCH) weight(anchorin 24), product 
of: 

■ 0.31401145 = queryWeight(anchor;U-:'c5(>tij'-l=(^o^2.0), product of: 

- 2.0 = boost 

- 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.03497641 = queryNorm 

■ 0.5611129 = (MATCH) lieldWeight(anchor:y-:'c5(>iijda(^o in 24), 
product of: 

■ 1.0 = tf(termFreq(anchor:y-:'c5j>tijda(^o)=l) 

■ 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.125 = lieldNorm(lield=anchor, doc=24) 

o 5.458426E-4 = (MATCH) weight(content:y-\ 5 (>yjdat^o in 24), product of: 
- 0.03718038 = queryWeight(content:y-\5(>yjda<^“), product of: 
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■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03497641 = queryNorm 

■ 0.014680931 = (MATCH) lieldWeight(content4J-\5(>tijda<^o in 24), 
product of: 

■ 2.828427 = tf(termFreq(content:U-:'c5(>tij'3at^“)=8) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0048828125 = lieldNorm(lield=content, doc=24) 

o 0.116802454 = (MATCH) weight(title:y-:'c5(>tijd=t^»^l .5 in 24), product of: 
- 0.23550858 = queryWeight(title:'J-:'c5(>tij'3a^»^1.5), product of: 

- 1.5= boost 

■ 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.03497641 = queryNorm 

■ 0.4959584 = (MATCH) fleldWeight(title:U-\5(»tijda<^o in 24), 
product of: 

■ 1.4142135 = tf(termFreq(title:y-:'c5(>iijda(^o)=2) 

- 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.078125 = fieldNorm(field=title, doc=24) 

Page 2: 

• boost = 0.23117816 

• digest = 6f317effade06eef85513b0eb565flb8 

• lang = ar 

• segment = 20100307101102 

• title = y^yj>c 5 - Ameriea.gov 

. tstamp = 20100307151115329 

• url = http://www.amerioa.gov/ar/global.html 

score for query: i j pjj i s 

. 0.17680001 = (MATCH) sum of: 
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o 0.17619587 = (MATCH) weight(anchor9in 23), product 
of: 

■ 0.31401145 = qucryWcight(anchor:U-:'c5(>tij'-i=(^o^2.0), product of: 

- 2.0 = boost 

" 4.488903 = idf(docFrcq=3, numDocs=131) 

■ 0.03497641 = qucryNorm 

- 0.5611129 = (MATCH) ficldWcight(anchor:y-:'c5(>iijd=c5“ in 23), 
product of: 

■ 1.0 = tf(tcrmFrcq(anchor:y-:'c5j>tijda(^o)=l) 

■ 4.488903 = idf(docFrcq=3, numDocs=131) 

■ 0.125 = ficldNorm(ficld=anchor, doc=23) 

o 6.041371E-4 = (MATCH) wcight(contcnt:y-\5(>yjd=c5“ in 23), product of: 

- 0.03718038 = qucryWcight(contcnt:y-\5(>yjdat^“), product of: 

- 1.063013 = idf(docFrcq=122, numDocs=131) 

■ 0.03497641 = qucryNorm 

■ 0.016248815 = (MATCH) ficldWcight(contcnt:y-\ 5 j>yjda<^o in 23), 
product of: 

■ 2.236068 = tf(tcrmFrcq(contcnt:y-:'c5(>yj'4a^“)=5) 

- 1.063013 = idf(docFrcq=122, numDocs=131) 

■ 0.0068359375 = ficldNorm(ficld=contcnt, doc=23) 

Page 3: 

. boost = 0.11378951 

• digest = ab333ad468abf764c43637fe53b7e4f7 

• lang = ar 

• segment = 20100307101102 

• title = J ^(j IjClijJ yol^f - 0110 - 

America.gov 

. tstamp = 20100307151214969 
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• url = http://www.america.gov/ar/publications/ejoumalusa/0110.html 

score for query: i j pjj i s 

. 0.033559922 = (MATCH) sum of: 

o 5.2319805E-4 = (MATCH) weight(content: in 89), product of: 

■ 0.03718038 = queryWeight(content:U-\5(>tij0a(^“), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

- 0.03497641 = queryNorm 

■ 0.0140718855 = (MATCH) tieldWeight(content:iJ-:'c5(>tij'3at^“ in 
89), product of: 

- 3.8729835 = tf(termFreq(content:U-:'c5j>tij'-lat^“)=15) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0034179688 = fieldNorm(field=content, doc=89) 

o 0.033036724 = (MATCH) weight(title:y-^c5(>tijd=t^»^l .5 in 89), product of: 

■ 0.23550858 = queryWeight(title:'J-:'c5(>tij'-la^^»^1.5), product of: 

- 1.5= boost 

- 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.03497641 = queryNorm 

- 0.14027822 = (MATCH) fieldWeight(title:U-ic5ftijl-i=L5» in 89), 

product of: 

■ 1.0 = tf(termFreq(title:y-^c5(>tijda(^o)=l) 

- 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.03125 = fieldNorm(field=title, doc=89) 

Page 4: 

• boost = 0.028445216 

• digest = 9212154ec8740ad77458648f74aal49c 

• lang = ar 

• segment = 20100307101343 

• title = i^is | Jof cJ ■ America.gov 
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. tstamp = 20100307151423606 

• url = http://www.america.gov/ar/democracy/global/index.html 

score for query: i j pjj i s 

. 0.030395675 = (MATCH) sum of: 

o 0.022024484 = (MATCH) weight(anchor:in 15), product 
of: 

- 0.31401145 = queryWeight(anchor:U-:'c5(>tij'3a^»^2.0), product of: 

■ 2.0 = boost 

- 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.03497641 = queryNorm 

- 0.07013911 = (MATCH) fieldWeight(anchor:in 15), 
product of: 

■ 1.0 = tf(termFreq(anchor:y-:'c5j>tijd=c5“)=l) 

- 4.488903 = idf(docFreq=3, numDocs=131) 

■ 0.015625 = fieldNorm(field=anchor, doc=15) 

o 1.12010006E-4 = (MATCH) weight(content:y-:ic5(>tij'3a<^“ in 15), product 
of: 

■ 0.03718038 = queryWeight(content:y-\5(>yjda(^“), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03497641 = queryNorm 

- 0.0030126106 = (MATCH) lieldWeight(content:y-:'c5(>tij'3at^“ in 
15), product of: 

■ 3.3166249 = tf(termFreq(content:y-:'c5(>tij'3at^“)=l 1) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 8.544922E-4 = fieldNorm(field=content, doc=15) 

o 0.008259181 = (MATCH) weight(title1.5 in 15), product of: 

- 0.23550858 = queryWeight(title:y-^c5(>tij'4=t^»^1.5), product of: 

■ 1.5= boost 

■ 4.488903 = idf(docEreq=3, nurrLDocs=131) 
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0.03497641 = queryNorm 


- 0.035069555 = (MATCH) fieldWeight(title4JA#^tijO=t^» in 15), 
product of: 

■ 1.0 = tf(termFreq(title:y-:'c5(>tijda(^o)=l) 

" 4.488903 = idf(docFreq=3, numDocs=131) 

- 0.0078125 = fieldNorm(field=title, doc=15) 

Page 5: 

• boost = 1.0000145 

• digest = eed4dd9817b50ffda0aefl58be6e4el2 

• lang = ar 

• segment = 20100307101052 

• title = U'jJls - U'jJls - America.gov 

. tstamp = 20100307151057483 

• url = http://www.america.gov/ar/ 
score for query: i j pjj i s 

. 0.0021392573 = (MATCH) sum of: 

o 0.0021392573 = (MATCH) weight(content:U-:'c5(>iij'3a<^“ in 0), product of: 

- 0.03718038 = queryWeight(content:U-\5(>tijda<^“), product of: 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03497641 = queryNorm 

■ 0.05753726 = (MATCH) fieldWeight(content:y-:'c5(»iijda(^“ in 0), 

product of: 

■ 1.7320508 = tf(termFreq(content:y-:'c5(>tij'3a<^“)=3) 

■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03125 = fieldNorm(field=content, doc=0) 

Page 6: 

• boost = 0.22865272 

. digest = dcl27d214554a59575782c318462f4e8 
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• lang = ar 

• segment = 20100307101102 

• title = M. Ameriea.gov 
. tstamp = 20100307151128611 

• url = http://www.america.gov/ar/multimedia/podeast.html 
score for query: i j pjj i s 

. 6.041371E-4 = (MATCH) sum of: 

o 6.041371E-4 = (MATCH) weight(content:U-\5(>iijdat^“ in 61), product of: 

- 0.03718038 = queryWeight(content:U-\5(>iijda<^“), product of: 

■ 1.063013 = idf(docfreq=122, numDocs=131) 

- 0.03497641 = queryNorm 

■ 0.016248815 = (MATCH) fieldWeight(content:U-\5(>tijda<^o in 61), 

product of: 

- 2.236068 = tf(termfreq(content:U-:'c5j>tij'3a^“)=5) 

- 1.063013 = idf(docfreq=122, numDocs=131) 

- 0.0068359375 = fieldNorm(field=content, doc=61) 

Page 7: 

• boost = 0.23039404 

• digest = 8ed8fcd743fflce5d4c42db83fc549af 

• lang = ar 

• segment = 20100307101102 

• title = - America.gov 

. tstamp = 20100307151136760 

• url = http://www.america.gov/ar/amlife.html 
score for query: i j pjj i s 

. 5.4035656E-4 = (MATCH) sum of 

o 5.4035656E-4 = (MATCH) weight(content:in 3), product of 

- 0.03718038 = queryWeight(content:U-\5(>iijdat^“), product of: 
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■ 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03497641 = queryNorm 

- 0.01453338 = (MATCH) fieldWeight(content:y-^t^(>iijda(^“ in 3), 
product of: 

■ 2.0 = tf(termFreq(content:y-:'<^j>tij'3at^“)=4) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0068359375 = fieldNorm(field=content, doc=3) 

Page 8: 

• boost = 0.22700267 

• digest = e639ba79e6601fl242cee32b3ba640f4 

• lang = ar 

• segment = 20100307101102 

• title = j'JjU- Ameriea.gov 
. tstamp = 20100307151120668 

• url = http://www.america.gov/ar/amlife/people.html 

score for query: i j pjj i s 

. 4.6796253E-4 = (MATCH) sum of: 

o 4.6796253E-4 = (MATCH) weight(content:y-:'t^j>tij'3at^o in 7), produet of: 

- 0.03718038 = queryWeight(oontent:y-\5(>yjda(^“), produet of: 

■ 1.063013 = idf(dooEreq=122, numDoos=131) 

■ 0.03497641 = queryNorm 

■ 0.012586276 = (MATCH) fieldWeight(content:y-\5(>tijd=<^“ in 7), 

product of: 

■ 1.7320508 = tf(termFreq(content:y-:'t^j>tij'3at^“)=3) 

- 1.063013 = idf(docEreq=122, numDocs=131) 

- 0.0068359375 = fieldNorm(field=content, doc=7) 

Page 9: 


boost = 0.22826105 
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. digest = c33a5dc3f7d8475491bfafcff Ic8b283 

• lang = ar 

• segment = 20100307101102 

• title = Uljejjj-ab - Ijl- Ameriea.gov 


. tstamp = 20100307151153574 

• url = http://www.ameriea.gov/ar/eeon.html 

score for query: i j pjj i s 

. 4.6796253E-4 = (MATCH) sum of: 

o 4.6796253E-4 = (MATCH) weight(oontent:'J-:'t^(>tij'3at^“ in 16), produet of: 
■ 0.03718038 = queryWeight(oontent:U-\5(>tijda(^“), produet of: 

- 1.063013 = idf(dooEreq=122, numDoos=131) 

■ 0.03497641 = queryNorm 

- 0.012586276 = (MATCH) fieldWeight(oontent:U-\5(>tijda(^o in 16), 

produet of: 

■ 1.7320508 = tf(termEreq(content:U-:'c5j>tij'3at^“)=3) 

- 1.063013 = idf(dooEreq=122, numDoos=131) 

■ 0.0068359375 = fieldNorm(field=content, doe=16) 

Page 10: 

. boost = 0.23110132 

• digest = 3e7f5elde4d604ef275f72043bf8efel 

• lang = ar 

• segment = 20100307101102 

• title = seeondary Multimedia - - Ameriea.gov 

. tstamp = 20100307151158462 

• url = http://www.amerioa.gov/ar/multimedia.html 

score for query: i j pjj i s 

. 4.6796253E-4 = (MATCH) sum of: 

4.6796253E-4 = (MATCH) weight(content:iJ-:'c5(>tij'3at^“ in 38), product of: 
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■ 0.03718038 = queryWeight(content3J-\5(>iij0a(^“), product of: 

" 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.03497641 = queryNorm 

- 0.012586276 = (MATCH) fieldWeight(content:U-\5(>tijda(^“ in 38), 

product of: 

■ 1.7320508 = tf(terrrLFreq(content:iJ-:'c5j>tij'3at^“)=3) 

- 1.063013 = idf(docFreq=122, numDocs=131) 

■ 0.0068359375 = fieldNorm(field=content, doc=38) 
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